From now on, we do things inside a project to keep a clear and coherent workspace. Go to File –> New Project (choose either a New directory or an existing one). For instance, you can choose the folder you downloaded for this session from ILIAS. You can have an overview of the folder in the output pane.

Set working directory

Load libraries

We use the ggplot2 package today. However, it is included in tidyverse, so we do not need to load it separately. But if you do not use tidyverse, you should install and load the library.

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.2.1     ✔ readr     2.2.0
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.2     ✔ tibble    3.3.1
## ✔ lubridate 1.9.5     ✔ tidyr     1.3.2
## ✔ purrr     1.2.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Story

As you remember, you left academia and decided to invest your time and money in a dog shelter. You received the dog records from the previous owner in a very poor format.

You did several rounds of data cleaning. Then based on the cleaned data, you created content for your website, and also you created a linguistic corpus called Furry Friends Corpus.

Furthermore, since your dogs are getting more popular, you have received more funding and you managed to hire an assistant to do further analysis on your dog’s lifestyle in order to improve their well-being. This is why the new dataset has lot more columns, including data about their inherent activity level, behavior, color, diet type, etc.

Your goal for today: Now, to help your websites visitors, you have decided to add more visual info such as plots to the website.

Read in the data

Let’s bring in the dog and corpus data into R.

# UD corpus created by udpipe
dog_ud <- read_tsv("data/dog_ud.tsv")
## Rows: 2162 Columns: 14
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (9): doc_id, sentence, token, lemma, upos, xpos, feats, dep_rel, misc
## dbl (4): paragraph_id, sentence_id, token_id, head_token_id
## lgl (1): deps
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Extended dog data
dog_df <- read_tsv("data/enhanced_dog_dataset.tsv")
## Rows: 27 Columns: 31
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (14): breed, name, sex, cage, height_vs_range, weight_vs_range, feed_ins...
## dbl (17): id, id2, weight_20, weight_21, weight_22, weight_23, height_low, h...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Data Visualization

Content of this session is mostly from the book “Modern Data Visualization with R” link, in some cases copy-pasted.

GGPLOT

Ggplot is a data visualization package in R. The idea behind it is to describe plots as a collection of independent components: data, geom, aesthetics, scales, and so on. Therefore, these components are built on top of each other and they are chained together with the symbol +.

Components of a ggplot

There are different ways of looking at the data. For instance, you can look at the distributions of data from a single variable (univariate graphs), or the relation between two variables (bivariate graphs), or the relation among several variables (multivariate graphs).

Furthermore, our variables might be categorical (such as gender) or numeric/quantitative/continuous (such as age or salary).

Univariate graphs

Univariate graphs plot the distribution of data from a single variable. For instance, if we have a “smoking” column with values “smoker” and “non-smoker”, it plots how many of the observations in this column are smoker and how many are non-smoker. The variable we are plotting can be categorical (e.g., race, sex, political affiliation) or quantitative (e.g., age, weight, income).

Story: Since you’re super proud of the diversity in your shelter, like the different breeds, colors, sizes, and all, you want to show it off somehow. That’s why you’ve decided to make a graph that shows how many dogs of each color you have. You can use a bar chart for this purpose. However, before that, let’s use the table() function just to get an idea of what we have.

table(dog_df$dog_color)
## 
##   black   brown  golden spotted   white 
##       4       8       3       5       7

Creating a Bar chart

The first function in building a graph is the ggplot() function.

It specifies the data frame to be used and the mapping of the variables to the visual properties of the graph.

The mappings are placed within the aes function, which stands for aesthetics.

The first thing we have to do is to map our desired data to a ggplot map. In this case, it would be the data from column “color” of the dog_df dataframe.

It is a univariate plot, because we are plotting the distribution of a single variable. Therefore, we only map the x column to the graph.

ggplot(data = dog_df,
       mapping = aes(x = dog_color)) ## we use aes for mapping. Here you see the full syntax [mapping = aes()], but you can omit "mapping" and keep only the aes() function.

Why is the graph empty?

We specified that the “dog_color” variable should be mapped to the x-axis, but we haven’t yet specified what we type of plot we want to have.

GEOMS

Geoms are the geometric objects (points, lines, bars, etc.) that can be placed on a graph. They are added using functions that start with geom_ such as geom_bar, geom_point, etc. For now, we want to put bar on the x-axis.

IMPORTANT: In ggplot2 graphs, functions are chained together using the plus + sign to build a final plot. Therefore, we can chain the ggplot function to geom_bar

ggplot(dog_df, aes(x = dog_color)) + 
  geom_bar()

Story: The graph has useful information, but it is not attractive enough for the website. You decide to make some changes to it to make it more applealing and informative to the visitors.

There are a couple of things we can change in this graph. For instance:

  • changing the color and transparency of the bars.
  • changing the color of the bar borders.
  • Changing the graph title and axis labels.
  • Changing width of Bars
  • Changing the orientation of the bar (horizontal vs. vertical)

Check this comprehensive reference: link

colorLink

We start by changing the color (fill), transparency (alpha), the border color (color), and width (width) of the graphs. These changes will be made to the bars, therefore, there will be added as parameters to the bar_geom function.

ggplot(dog_df, aes(x = dog_color)) + 
  geom_bar(fill = "purple",  alpha = .3, color = "black", width = .3)

# The bars are filled with purple color (fill = "purple"), semi-transparent (alpha = .3), with black borders (color = "black"), and have a width of 0.3 (width = .3).

As you can see, our plot has an ugly gray background. We can set different themes for our plot, for instance theme_minimal(). Themes are a set of visual design elements for a plot. They define the overall appearance of the plot, like background color, grid lines, and font styles.

ggplot(dog_df, aes(x = dog_color)) + 
  geom_bar(fill = "purple",  alpha = .3, color = "black", width = .3) + 
  theme_minimal()

We can also change the labels of the x and y axis, and give a title to our graph. We can use the function lab() to change the main title, subtitle, axis labels, caption, etc.

ggplot(dog_df, aes(x = dog_color)) + 
  geom_bar(fill = "purple", alpha = .3, color = "black", width = .3) +
  labs(x = "Color of the dogs",
       y = "Frequency", 
       title = "Color distribution of our dogs", 
       subtitle = "Bar plot", 
       caption = "This bar plot shows the color distribution of the dogs in our shelter. As can be seen, brown dogs are the most common ones in the shelter.")

# It is always always a good practice to add good and informative captions to your graphs. Don't be afraid of making them long.

Let’s look at another categorical variable with more categories.

Story: While you are working on the visualization of your website, you also work on the linguistic paper you are writing with your colleague Maria about the Furry Friends Corpus. For that, you are interested to know the distribution of different part of speech tags in the dog descriptions provided in the corpus. For that, we can look at the distribution of part of speech tags in the “upos” column in the “dog_ud” dataframe. Let’s plot it first:

ggplot(dog_ud, aes(x = upos)) + 
  geom_bar(fill = "lightseagreen")

Here, we got the count of POS tags from the upos column (x = upos) and plotted them.

Another way to plot: we first create a frequency dataframe (pos_freq) in which we store the frequency of each postag in a new column. Then, we map the POS tags from the upos column to the x-axis and the counts from the count column to the y-axis. Let’s do it (also we can practice a bit of group-by, summarise this way.)

# In what follows, we group the data by the variable upos and count the number of rows in each upos

pos_freq <- dog_ud %>% 
  group_by(upos) %>%  
  summarise(., count= n()) %>% 
  ungroup()

Now, we can plot the data from the pos_freq dataframe. This time, for the mapping, we map the upos info to the x-axis, and the count info to the y-axis. Then, we tell to our plot that we got the counts from a column and it does not need to count them. The option stat="identity" placed in geom_bartells the plotting function not to calculate counts, because they are supplied directly.

ggplot(pos_freq, 
       aes(x = upos, y = count)) + 
  geom_bar(stat="identity", fill = "lightcoral", width = 0.5) + 
  labs(x = "POS tags", 
       y = "Frequency", 
       title  = "POS tag distributions in the Furry Friends corpus")

It looks good, but not too good. To improve the readability, we may want to:

  1. Order the values in an ascending order. For this, we can use the reorder function to sort the categories by the frequency (in ascending order).
ggplot(pos_freq, 
       aes(x = reorder(upos, count), y = count)) + 
  geom_bar(stat="identity", fill = "lightcoral", width = 0.5)

Any idea how to make it descending?

  1. Change the orientation of the graph (from vertical to horizontal). We use the function coord_flip for this purpose.
ggplot(pos_freq, 
       aes(x = reorder(upos, count), y = count)) + 
  geom_bar(stat="identity", fill = "lightcoral", width = 0.5) +
  coord_flip()

  1. Add the actual counts to the bar plots. By geom_text() we can add text directly to the plot.
ggplot(pos_freq, aes(x = reorder(upos, count), y = count)) +
  geom_bar(stat = "identity", fill = "lightcoral") +
  geom_text(aes (label = count), vjust = -0.3) # This means that the value of the labels are coming from the count column. 

And to make every category look more distinctive, let’s give each category a different color. Also, we should not forget to change the labels.

ggplot(pos_freq, aes(x = reorder(upos, count), y = count, fill = upos)) +
  geom_bar(stat = "identity") +
  geom_text(aes (label = count), vjust = -0.3) +
  labs(x = "POS tags", y = "Frequency", title = "POS tag frequency in Furry Friends Corpus" )

# fill = "upos": This sets the fill color of the bars based on the upos categories. Each unique upos value will be represented by a different color.

Saving Graphs

Story: As mentioned earlier, we want to use the created POS tag plot in our paper which is written in a word document. Therefore, we need to save our plot first as an image (e.g., png).

Graphs can be saved via the RStudio interface (less preferred) or through code (much more preferred).

To save the graph via R Studio, go to the Plots panels –> Export –> save as image. There, you can choose the right height and width for the image.

To save it via code, you can use the ggsave function.

ggplot(pos_freq, aes(x = reorder(upos, count), y = count, fill = upos)) +
  geom_bar(stat = "identity") +
  geom_text(aes (label = count), vjust = -0.3) +
  labs(x = "POS tags", y = "Frequency", title = "POS tag frequency in Furry Friends Corpus" )

ggsave(filename = "plot/test_pos_plot.png")
## Saving 7 x 5 in image
ggsave(filename = "plot/pos_plot.png",
       width = 3000, 
       height = 1200, units = "px")

Note: by default, ggsave saves the last plot you have created. However, you can specify a different plot by using the plot argument.

Any ggplot2 graph can be saved as an object. Then you can use the ggsave function to save the graph to disk. It is a good practice to save your plot as an object and then pass its name to ggsave.

posPlot <- ggplot(pos_freq, aes(x = reorder(upos, count), y = count, fill = upos)) +
  geom_bar(stat = "identity") +
  geom_text(aes (label = count), vjust = -0.3) +
  labs(x = "POS tags", y = "Frequency", title = "POS tag frequency in Furry Friends Corpus" )

ggsave(posPlot, filename = "plot/pos_plot.png",
       width = 3000, 
       height = 1200, units = "px")

Creating a Histogram

Histograms are the most common approach to visualizing a quantitative (numeric) variable. In a histogram, the values of a variable are typically divided up into adjacent, equal width ranges (called bins), and the number of observations in each bin is plotted with a vertical bar. For instance, if our dogs’ age is from 0 (new born) to 12 years old, then we might decide to group them into 4 bins: 0-3, 3-6, 6-9, 9-12.

Story: You are deeply committed to the well-being of your dogs and ensure they receive regular daily exercise. You believe that by showcasing this practice to a broader audience, you can achieve two goals: (1) inspire other shelters to increase the frequency of exercise for their dogs, and (2) attract families who prioritize the well-being of their pets. Consequently, you have decided to plot the extent of your dogs’ exercise on a histogram and display it on your website.

# plot the excercise time distribution using a histogram

ggplot(dog_df, aes(x = daily_exercise_time_min)) + 
  geom_histogram(fill = "blue", color = "black") +
  labs(title = "Histogram of Daily Exercise Time",
       x = "Daily Exercise Time (minutes)",
       y = "Count of Dogs") 
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.

Two parameters are important in histograms: number of bins (bins) and range of bins (binwidth). Bins controls the number of bins into which the numeric variable is divided (i.e., the number of bars in the plot). The default is 30, but it is helpful to try smaller and larger numbers to get a better impression of the shape of the distribution.

ggplot(dog_df, aes(x = daily_exercise_time_min)) + 
  geom_histogram(fill = "blue", bins = 10, color = "black") +
  labs(title = "Histogram of Daily Exercise Time",
       x = "Daily Exercise Time (minutes)",
       y = "Count of Dogs")

Alternatively, you can specify the binwidth, the width of the bins represented by the bars.

ggplot(dog_df, aes(x = daily_exercise_time_min)) +
  geom_histogram(fill = "blue", binwidth = 10, color = "black") + 
  labs(title = "Histogram of Daily Exercise Time",
       subtitle = "number of bins = 10",
       x = "Daily Exercise Time (minutes)",
       y = "Count of Dogs")

The histogram shows a somewhat irregular distribution of daily exercise times (not a normal distribution which would be symmetrical, and not a skewed one which would have a tail on one side.)

The most common exercise time seems to fall between 25 to 50 minutes, as this bin has the highest count of dogs.

There is considerable variability in exercise times as the times range from less than 25 minutes to over 100 minutes. The bins representing the shortest (<25 minutes) and longest (>100 minutes) exercise times are less populated.

Before we move on, another thing you can change is the scale shown on the x-axis and y-axis using scales. Scales control how variables are mapped to the visual characteristics of the plot. Scale functions (which start with scale_) allow you to modify this mapping.

ggplot(dog_df, aes(x = daily_exercise_time_min)) +
  geom_histogram(fill = "blue", binwidth = 10, color = "black") + 
  labs(title = "Histogram of Daily Exercise Time",
       subtitle = "number of bins = 10",
       x = "Daily Exercise Time (minutes)",
       y = "Count of Dogs") +
  scale_x_continuous(breaks = seq(0,130,10)) 

# scale_x_continuous(breaks = seq(0,130,10)) tells the graph that the values on the x-axis are continuous. They are a sequence of numbers from 0 to 130 and we have breaks every 10 units. This means you will see marks on the x-axis at 0, 10, 20, 30, 40, 50, 60, etc. 

## Can you change the scale for the y axis?

Bivariate Graphs

Bivariate graphs display the relationship between two variables. So, they are good for answering the question “what is the relationship between A and B”. The type of the graphs depends on the measurement level of each variable (categorical or quantitative).

Categorical vs. Categorical variables

When plotting the relationship between two categorical variables, stacked, grouped, or segmented bar charts are typically used.

Stacked bar chart

Story: You are very concerned about the respective adopters for your dogs. You think if you illustrate the connection between the dogs’ behavioral patterns and their activity levels, it will aid future owners in finding a compatible pet by providing insights into the dogs’ general behavior and energy needs. For this reason, you decide to plot the correlation between dog’s behavior and their level of activity. You can use a bar plot in a stacked mode for this purpose.

# stacked bar chart
ggplot(dog_df, aes(x = behavior , fill = activity_level)) + 
  geom_bar(position = "stack")

The graph is a stacked bar chart that displays the relationship between the behavior and activity level of dogs. There are five categories of behavior. Each behavior category has three associated activity levels: high, moderate, and low, represented by different colors (high in red, low in green, and moderate in blue).

We can for instance see in this example that playful dogs belong only to the category of dogs with high or moderate activity. This is a quite expected outcome. Also, we see that shy dogs have dominantly low activity.

Grouped bar chart Grouped bar charts place bars for the second categorical variable side-by-side. To create a grouped bar plot use the position = “dodge” option.

ggplot(dog_df, aes(x = behavior, fill = activity_level)) + 
  geom_bar(position = "dodge")

This is a grouped bar chart. The activity levels are color-coded and represented as “high” (red), “low” (green), and “moderate” (blue). Each behavior category has three bars corresponding to these activity levels, showing the count for each.

Segmented bar chart

A segmented bar plot is a stacked bar plot where each bar represents 100 percent. You can create a segmented bar chart using the position = “filled” option.

ggplot(dog_df, aes(x = behavior, fill = activity_level )) + 
  geom_bar(position = "fill")

Categorical vs. Quantitative variables

This is one of the most common types of correlation, and there are many different types of graphs to plot such a correlation. Box plot is a popular option. A boxplot displays the 25th percentile, median, and 75th percentile of a distribution. The whiskers (vertical lines) capture roughly 99% of a normal distribution, and observations outside this range are plotted as points representing outliers.

Story: We want to explore the relationship between a dog’s activity level and its daily exercise time. This analysis would be beneficial for understanding how the inherent activity level of dogs (categorized as low, moderate, or high) correlates with the actual amount of exercise they get each day. This could inform on whether dogs with higher activity levels are indeed getting more exercise. If not, we should find a solution. We can use a box plot to visualize the distribution of daily exercise time (daily_exercise_time_min) across different activity levels (activity_level). We use geom_boxplot for that.

ggplot(dog_df, aes(x = activity_level, y = daily_exercise_time_min, fill = activity_level)) +
  geom_boxplot(alpha = .6) +
  labs(title = "Box Plot of Daily Exercise Time by Activity Level",
       x = "Activity Level",
       y = "Daily Exercise Time (minutes)") +
  theme_minimal()

The bottom and top of the boxes represent the first (Q1) and third (Q3) quartiles, respectively, indicating that the middle 50% of the data falls within this range. The horizontal line inside the box indicates the median, which is the middle value of the data. The whiskers extend from the box to the smallest and largest values within 1.5 times the interquartile range (IQR) from the Q1 and Q3 [(Q3 - Q1) * 1.5]. Points outside this range are considered outliers and are represented as dots.

Dogs with a high activity level have the widest IQR range. The median exercise time is around 60 minutes. The IQR for the other two boxes is narrower, showing less variability in the exercise time they get. For the low activity, we see this crazy outlier (shown by a dot), meaning that one of the dogs gets from the low activity level gets radically more exercise. I think this is the owner’s favorite and the owner takes it with her everywhere :-D Furthermore, it seems the moderate and the low group have similar median. This could be a concern, as the moderate group perhaps need more exercise than the low activity group.

Quantitative vs. Quantitative variables

The simplest display of two quantitative continuous variables is a scatterplot, with each variable represented on an axis.

Story: You have different assistants in the shelter and some take dogs outside for a walk more than others. You have noticed some differences in the sleeping pattern of your dog and you are not sure where it comes from. However, one of your speculations is that the amount of daily exercise has a positive impact on the hours the dog sleep, i.e., you expect the dogs who exercise more also sleep more at night (since they are tired.) We plot these trends using a scatterplot (geom_point). It displays points at the coordinates determined by the values of the x and y variables (in this case, sleep hours and daily exercise time). Each point on the plot represents one dog from the dataset.

ggplot(dog_df, aes(x = daily_exercise_time_min, y = sleep_amount_hours)) +
  geom_point() +
  labs(title = "Scatter Plot of Dog Sleeping Hours vs. Daily Exercise Time",
       x = "Daily Exercise Time (minutes)",
       y = "Sleep Hours (hours)") +
  theme_minimal()

It is often useful to summarize the relationship displayed in the scatterplot, using a best fit line. Many types of lines are supported, including linear, polynomial, and nonparametric (loess).

ggplot(dog_df, aes(x = daily_exercise_time_min, y = sleep_amount_hours)) +
  geom_point() +
  labs(title = "Scatter Plot of Dog Sleeping Hours vs. Daily Exercise Time",
       x = "Daily Exercise Time (minutes)",
       y = "Sleep Hours (hours)") +
  theme_minimal() +
  geom_smooth(method = "lm")
## `geom_smooth()` using formula = 'y ~ x'

# geom_smooth(method = "lm") in adds a linear regression line to the scatter plot, showing the average linear relationship between daily exercise time and sleep hours in dogs. The line added to the plot represents the best-fit linear relationship between these two variables.

Multivariate Graphs

Sometimes, we are interested to study the relationship between several variables. Multivariate graphs display the relationships among three or more variables. There are two common methods for accommodating multiple variables: grouping and faceting.

Grouping

In grouping, the values of the first two variables are mapped to the x and y axes. Then additional variables are mapped to other visual characteristics such as color, shape, size, line type, and transparency. Grouping allows you to plot the data for multiple groups in a single graph.

For grouping, we can introduce the oter variables by the change in their color, shape, etc.

ggplot(dog_df, aes(x = daily_exercise_time_min, 
                   y = sleep_amount_hours,
                   color = activity_level)) +
  geom_point() +
  labs(title = "Dog Sleeping Hours by Daily Exercise Time and Activity Level",
       x = "Daily Exercise Time (minutes)",
       y = "Sleep Hours (hours)") +
  theme_minimal() 

Faceting

Grouping allows you to plot multiple variables in a single graph, using visual characteristics such as color, shape, and size. In faceting, a graph consists of several separate plots or small multiples, one for each level of a third variable, or combination of two variables.

ggplot(dog_df, aes(x = dog_color, fill = dog_color)) + 
  geom_bar() +
  facet_wrap(~sex, ncol = 1) +
  labs(title = "Bar plot of Dogs' Color by Sex",
       x = "Dogs' Color",
       y = "Count of Dogs") 

The facet_wrap function creates a separate graph for each level of sex. The ncol option controls the number of columns.

More fun themes

You can have different plot themes, such as barbie (or many others actually).

#remotes::install_github("MatthewBJane/ThemePark")

## https://github.com/MatthewBJane/ThemePark

library(ThemePark)
ggplot(dog_df, aes(x = daily_exercise_time_min, y = sleep_amount_hours)) +
  geom_point(color = "lightseagreen") +
  labs(title = "Scatter Plot of Dog Sleeping Hours vs. Daily Exercise Time",
       x = "Daily Exercise Time (minutes)",
       y = "Sleep Hours (hours)") +
  geom_smooth(method = "lm", color = "purple")+
  theme_barbie()
## `geom_smooth()` using formula = 'y ~ x'

Advice/Best practices

link