Univariate graphs plot the distribution of data from a single
variable. For instance, if we have a “smoking” column with values
“smoker” and “non-smoker”, it plots how many of the observations in this
column are smoker and how many are non-smoker. The variable we are
plotting can be categorical (e.g., race, sex, political affiliation) or
quantitative (e.g., age, weight, income).
Creating a Bar chart
The first function in building a graph is the ggplot()
function.
It specifies the data frame to be used and the
mapping of the variables to the visual properties of
the graph.
The mappings are placed within the aes function, which
stands for aesthetics.
The first thing we have to do is to map our desired data to a ggplot
map. In this case, it would be the data from column “color” of the
dog_df dataframe.
It is a univariate plot, because we are plotting the distribution of
a single variable. Therefore, we only map the x column to the graph.
ggplot(data = dog_df,
mapping = aes(x = dog_color)) ## we use aes for mapping. Here you see the full syntax [mapping = aes()], but you can omit "mapping" and keep only the aes() function.

Why is the graph empty?
We specified that the “dog_color” variable should be mapped to the
x-axis, but we haven’t yet specified what we type of plot we want to
have.
GEOMS
Geoms are the geometric objects (points, lines, bars, etc.) that can
be placed on a graph. They are added using functions that start with
geom_ such as geom_bar,
geom_point, etc. For now, we want to put bar on the
x-axis.
IMPORTANT: In ggplot2 graphs, functions are chained together using
the plus + sign to build a final plot. Therefore, we can
chain the ggplot function to geom_bar
ggplot(dog_df, aes(x = dog_color)) +
geom_bar()

Story: The graph has useful information, but it is
not attractive enough for the website. You decide to make some changes
to it to make it more applealing and informative to the visitors.
There are a couple of things we can change in this graph. For
instance:
- changing the color and transparency of the bars.
- changing the color of the bar borders.
- Changing the graph title and axis labels.
- Changing width of Bars
- Changing the orientation of the bar (horizontal vs. vertical)
Check this comprehensive reference: link
colorLink
We start by changing the color (fill), transparency
(alpha), the border color (color), and width
(width) of the graphs. These changes will be made to the
bars, therefore, there will be added as parameters to the bar_geom
function.
ggplot(dog_df, aes(x = dog_color)) +
geom_bar(fill = "purple", alpha = .3, color = "black", width = .3)

# The bars are filled with purple color (fill = "purple"), semi-transparent (alpha = .3), with black borders (color = "black"), and have a width of 0.3 (width = .3).
As you can see, our plot has an ugly gray background. We can set
different themes for our plot, for instance
theme_minimal(). Themes are a set of visual design elements
for a plot. They define the overall appearance of the plot, like
background color, grid lines, and font styles.
ggplot(dog_df, aes(x = dog_color)) +
geom_bar(fill = "purple", alpha = .3, color = "black", width = .3) +
theme_minimal()

We can also change the labels of the x and y axis, and give a title
to our graph. We can use the function lab() to change the
main title, subtitle, axis labels, caption, etc.
ggplot(dog_df, aes(x = dog_color)) +
geom_bar(fill = "purple", alpha = .3, color = "black", width = .3) +
labs(x = "Color of the dogs",
y = "Frequency",
title = "Color distribution of our dogs",
subtitle = "Bar plot",
caption = "This bar plot shows the color distribution of the dogs in our shelter. As can be seen, brown dogs are the most common ones in the shelter.")

# It is always always a good practice to add good and informative captions to your graphs. Don't be afraid of making them long.
Let’s look at another categorical variable with more categories.
Story: While you are working on the visualization of
your website, you also work on the linguistic paper you are writing with
your colleague Maria about the Furry Friends Corpus. For that, you are
interested to know the distribution of different part of speech tags in
the dog descriptions provided in the corpus. For that, we can look at
the distribution of part of speech tags in the “upos” column in the
“dog_ud” dataframe. Let’s plot it first:
ggplot(dog_ud, aes(x = upos)) +
geom_bar(fill = "lightseagreen")

Here, we got the count of POS tags from the upos column (x = upos)
and plotted them.
Another way to plot: we first create a frequency dataframe (pos_freq)
in which we store the frequency of each postag in a new column. Then, we
map the POS tags from the upos column to the x-axis and the counts from
the count column to the y-axis. Let’s do it (also we can practice a bit
of group-by, summarise this way.)
# In what follows, we group the data by the variable upos and count the number of rows in each upos
pos_freq <- dog_ud %>%
group_by(upos) %>%
summarise(., count= n()) %>%
ungroup()
Now, we can plot the data from the pos_freq dataframe. This time, for
the mapping, we map the upos info to the x-axis, and the count info to
the y-axis. Then, we tell to our plot that we got the counts from a
column and it does not need to count them. The option
stat="identity" placed in geom_bartells the
plotting function not to calculate counts, because they are supplied
directly.
ggplot(pos_freq,
aes(x = upos, y = count)) +
geom_bar(stat="identity", fill = "lightcoral", width = 0.5) +
labs(x = "POS tags",
y = "Frequency",
title = "POS tag distributions in the Furry Friends corpus")

It looks good, but not too good. To improve the readability, we may
want to:
- Order the values in an ascending order. For this, we can use the
reorder function to sort the categories by the frequency
(in ascending order).
ggplot(pos_freq,
aes(x = reorder(upos, count), y = count)) +
geom_bar(stat="identity", fill = "lightcoral", width = 0.5)

Any idea how to make it descending?
- Change the orientation of the graph (from vertical to horizontal).
We use the function
coord_flip for this purpose.
ggplot(pos_freq,
aes(x = reorder(upos, count), y = count)) +
geom_bar(stat="identity", fill = "lightcoral", width = 0.5) +
coord_flip()

- Add the actual counts to the bar plots. By
geom_text()
we can add text directly to the plot.
ggplot(pos_freq, aes(x = reorder(upos, count), y = count)) +
geom_bar(stat = "identity", fill = "lightcoral") +
geom_text(aes (label = count), vjust = -0.3) # This means that the value of the labels are coming from the count column.

And to make every category look more distinctive, let’s give each
category a different color. Also, we should not forget to change the
labels.
ggplot(pos_freq, aes(x = reorder(upos, count), y = count, fill = upos)) +
geom_bar(stat = "identity") +
geom_text(aes (label = count), vjust = -0.3) +
labs(x = "POS tags", y = "Frequency", title = "POS tag frequency in Furry Friends Corpus" )

# fill = "upos": This sets the fill color of the bars based on the upos categories. Each unique upos value will be represented by a different color.
Saving Graphs
Story: As mentioned earlier, we want to use the
created POS tag plot in our paper which is written in a word document.
Therefore, we need to save our plot first as an image (e.g., png).
Graphs can be saved via the RStudio interface (less preferred) or
through code (much more preferred).
To save the graph via R Studio, go to the Plots panels –> Export
–> save as image. There, you can choose the right height and width
for the image.
To save it via code, you can use the ggsave
function.
ggplot(pos_freq, aes(x = reorder(upos, count), y = count, fill = upos)) +
geom_bar(stat = "identity") +
geom_text(aes (label = count), vjust = -0.3) +
labs(x = "POS tags", y = "Frequency", title = "POS tag frequency in Furry Friends Corpus" )

ggsave(filename = "plot/test_pos_plot.png")
## Saving 7 x 5 in image
ggsave(filename = "plot/pos_plot.png",
width = 3000,
height = 1200, units = "px")
Note: by default, ggsave saves the last plot you have created.
However, you can specify a different plot by using the plot
argument.
Any ggplot2 graph can be saved as an object. Then you can use the
ggsave function to save the graph to disk. It is a good practice to save
your plot as an object and then pass its name to ggsave.
posPlot <- ggplot(pos_freq, aes(x = reorder(upos, count), y = count, fill = upos)) +
geom_bar(stat = "identity") +
geom_text(aes (label = count), vjust = -0.3) +
labs(x = "POS tags", y = "Frequency", title = "POS tag frequency in Furry Friends Corpus" )
ggsave(posPlot, filename = "plot/pos_plot.png",
width = 3000,
height = 1200, units = "px")
Creating a Histogram
Histograms are the most common approach to visualizing a quantitative
(numeric) variable. In a histogram, the values of a variable are
typically divided up into adjacent, equal width ranges (called bins),
and the number of observations in each bin is plotted with a vertical
bar. For instance, if our dogs’ age is from 0 (new born) to 12 years
old, then we might decide to group them into 4 bins: 0-3, 3-6, 6-9,
9-12.
Story: You are deeply committed to the well-being of
your dogs and ensure they receive regular daily exercise. You believe
that by showcasing this practice to a broader audience, you can achieve
two goals: (1) inspire other shelters to increase the frequency of
exercise for their dogs, and (2) attract families who prioritize the
well-being of their pets. Consequently, you have decided to plot the
extent of your dogs’ exercise on a histogram and display it on your
website.
# plot the excercise time distribution using a histogram
ggplot(dog_df, aes(x = daily_exercise_time_min)) +
geom_histogram(fill = "blue", color = "black") +
labs(title = "Histogram of Daily Exercise Time",
x = "Daily Exercise Time (minutes)",
y = "Count of Dogs")
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.

Two parameters are important in histograms: number of bins
(bins) and range of bins (binwidth). Bins
controls the number of bins into which the numeric variable is divided
(i.e., the number of bars in the plot). The default is 30, but it is
helpful to try smaller and larger numbers to get a better impression of
the shape of the distribution.
ggplot(dog_df, aes(x = daily_exercise_time_min)) +
geom_histogram(fill = "blue", bins = 10, color = "black") +
labs(title = "Histogram of Daily Exercise Time",
x = "Daily Exercise Time (minutes)",
y = "Count of Dogs")

Alternatively, you can specify the binwidth, the width of the bins
represented by the bars.
ggplot(dog_df, aes(x = daily_exercise_time_min)) +
geom_histogram(fill = "blue", binwidth = 10, color = "black") +
labs(title = "Histogram of Daily Exercise Time",
subtitle = "number of bins = 10",
x = "Daily Exercise Time (minutes)",
y = "Count of Dogs")

The histogram shows a somewhat irregular distribution of daily
exercise times (not a normal distribution which would be symmetrical,
and not a skewed one which would have a tail on one side.)
The most common exercise time seems to fall between 25 to 50 minutes,
as this bin has the highest count of dogs.
There is considerable variability in exercise times as the times
range from less than 25 minutes to over 100 minutes. The bins
representing the shortest (<25 minutes) and longest (>100 minutes)
exercise times are less populated.
Before we move on, another thing you can change is the scale shown on
the x-axis and y-axis using scales. Scales control how
variables are mapped to the visual characteristics of the plot. Scale
functions (which start with scale_) allow you to modify this
mapping.
ggplot(dog_df, aes(x = daily_exercise_time_min)) +
geom_histogram(fill = "blue", binwidth = 10, color = "black") +
labs(title = "Histogram of Daily Exercise Time",
subtitle = "number of bins = 10",
x = "Daily Exercise Time (minutes)",
y = "Count of Dogs") +
scale_x_continuous(breaks = seq(0,130,10))

# scale_x_continuous(breaks = seq(0,130,10)) tells the graph that the values on the x-axis are continuous. They are a sequence of numbers from 0 to 130 and we have breaks every 10 units. This means you will see marks on the x-axis at 0, 10, 20, 30, 40, 50, 60, etc.
## Can you change the scale for the y axis?