Last tutorial, we learned the essential elements of a ggplot2 graphic. That is:

**Data element****Aesthetics element****Geometries element**

These three elements are the essential blocks for a successful ggplot2 graphic.

Now in this tutorial, we will be explaining the other four remaining elements:

**Facets****Statistics****Coordinates****Themes**

These four elements are not essential to creating a ggplot2 graphic but are of great help when wanting to tell a story with a data set. Again, we will be using the Pokemon data set from Kaggle to illustrate concepts.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
library(tidyverse) poke <- read.csv("Pokemon.csv") poke <- poke %>% dplyr::mutate_if(., is.factor, as.character) poke_mod <- poke %>% dplyr::filter(., stringr::str_detect("(Grass|Water|Fire)", Type.1)) str(poke_mod) # $ X. : int 1 2 3 3 4 5 6 6 6 7 ... # $ Name : chr "Bulbasaur" "Ivysaur" "Venusaur" "VenusaurMega Venusaur" ... # $ Type.1 : chr "Grass" "Grass" "Grass" "Grass" ... # $ Type.2 : chr "Poison" "Poison" "Poison" "Poison" ... # $ Total : int 318 405 525 625 309 405 534 634 634 314 ... # $ HP : int 45 60 80 80 39 58 78 78 78 44 ... # $ Attack : int 49 62 82 100 52 64 84 130 104 48 ... # $ Defense : int 49 63 83 123 43 58 78 111 78 65 ... # $ Sp..Atk : int 65 80 100 122 60 80 109 130 159 50 ... # $ Sp..Def : int 65 80 100 120 50 65 85 85 115 64 ... # $ Speed : int 45 60 80 80 65 80 100 100 100 43 ... # $ Generation: int 1 1 1 1 1 1 1 1 1 1 ... # $ Legendary : chr "False" "False" "False" "False" ... |

# Faceting

We already discussed the aesthetics element and the color attribute last time when we compared different groups. We produced a graph such as this one:

1 2 |
ggplot(data = poke_mod, mapping = aes(x = Attack, y = Defense, color = Type.1)) + geom_point() |

However, there is an alternative approach when comparing groups. Faceting takes subgroups of the data and displays them in separate plots.

For the plot above, we can see that each Pokemon type is colored differently. A lot of these points are overlapping and it is hard to see differences between groups. Faceting tries to make differences more visible when we have a lot of differently colored points that are overlapping. Let’s try it out.

1 2 3 |
ggplot(data = poke_mod, mapping = aes(x = Attack, y = Defense, color = Type.1)) + geom_point() + facet_grid(cols = vars(Type.1)) |

In the above graph, we used the function `facet_grid()`

to display three separate plots. Each subgroup is plotted on their own. This `cols = vars(Type.1)`

means that we want a single row with multiple columns. This is a great way to make comparisons of y positions.

On the other hand if had specified `rows = vars(Type.1)`

then it would have created three rows with one column. This is particularly useful when one wants to compare x positions. It is also especially useful to compare distributions like below:

1 2 3 |
ggplot(data = poke_mod, mapping = aes(x = Attack, fill = Type.1)) + geom_histogram() + facet_grid(rows = vars(Type.1)) |

In the case above, we could compare distributions very easily because there are roughly the same amount of fire, grass, and water Pokemon in the data set. However, what happens when there is an imbalance in the data set?

Let us examine what happens when there are six times the amount of water Pokemon as there are originally. We multiply the amount of water Pokemon with the code below and then show the same plot again as previously.

1 2 3 4 5 6 7 8 9 10 11 |
poke_mod %>% base::split(., .$Type.1) %>% purrr::map(~ dplyr::filter(., Type.1 == "Water")) %>% base::Filter(function(x) dim(x)[1] > 1, .) %>% base::rep(., 5) %>% base::do.call(rbind, .) %>% base::rbind(., poke_mod) -> six_times_water ggplot(data = six_times_water, mapping = aes(x = Attack, fill = Type.1)) + geom_histogram() + facet_grid(rows = vars(Type.1)) |

As you can see this graph makes it hard to identify the counts for fire and grass Pokemon. This is because the scale of the y-axis adjusted to the increase in water Pokemon. How do we go about that? There is an argument in the `facet_grid()`

function which is `scales = "free"`

. The graph would look like this:

1 2 3 |
ggplot(data = six_times_water, mapping = aes(x = Attack, fill = Type.1)) + geom_histogram() + facet_grid(rows = vars(Type.1), scales = "free") |

Now we can actually see that the mode of fire Pokemon is around 80 attacking points and the mode of grass Pokemon is around 50 attacking points.

Another useful tool when faceting is the `facet_wrap()`

function when faceting by multiple variables. Let’s facet by type and if a Pokemon is legendary or not.

1 2 3 |
ggplot(data = poke_mod, mapping = aes(x = Attack, fill = Type.1)) + geom_histogram() + facet_wrap(vars(Type.1, Legendary)) |

This looks a bit messy because we can’t really compare legendary Pokemon for different types very well. Let’s try another layout of the panels in the grid.

With the argument `ncol = 2`

in the `facet_wrap()`

function, you can specify how many columns you want. The default column in the previous plot was three. Now, we want to have only two columns.

1 2 3 |
ggplot(data = poke_mod, mapping = aes(x = Attack, fill = Type.1)) + geom_histogram() + facet_wrap(vars(Type.1, Legendary), ncol = 2, scales = "free_y") |

This looks much better and we can compare legendary Pokemon much better across types. Notice, that we also let the y-axis to be free with the argument `scales = "free_y"`

.

It look like most legendary Pokemon are of type water. Legendary grass Pokemon have the lowest attacking points on average and the least legendary Pokemon.

Now that we discussed the most important points about faceting, we can move on to the statistics element.

# Statistics

Believe it or not but you already used a lot of stats elements when creating ggplots. You have used these stats when for example creating a scatter plot with a smooth curve.

1 2 3 |
ggplot(data = poke_mod, mapping = aes(x = Attack, y = Defense, col = Type.1)) + geom_point() + geom_smooth(se = FALSE, span = 0.7) |

In the above graph, we used stat_smooth behind the scenes to create the smoother curves. With the `span = 0.7`

argument, you can specify the “wiggliness” of the curves.

Another example where we have used statistical transformations to create a plot would be a histogram. For example:

1 2 |
ggplot(data = poke_mod, mapping = aes(x = Attack)) + geom_histogram() |

In the histogram above, `geom_histogram()`

uses `stat_bin()`

to use default bins and count the number of observations in each bin. What `stat_bin()`

does:

It takes the data frame and creates these variables based on the data frame:

`count`

, the number of observations in each bin`density`

, the density of observations in each bin`x`

, the center of the bin

1 2 |
ggplot(data = poke_mod, mapping = aes(x = Attack)) + geom_bar(stat = "bin") |

As you can see, `geom_histogram`

and `geom_bar(stat = "bin")`

are producing the exact same results. The only difference is that `geom_histogram()`

is using `stat_bin`

under the hood.

Enough with the boring stuff now. It is like knowing about how a linear regression algorithm works in R versus just applying the `lm()`

function. Most importantly, you have to know how to use it and interpret it and then later on you might be interested in knowing how the details work out.

More useful `stat`

functions are for example the `stat_function()`

which let’s you plot functions and distributions. Below, we are plotting a t-distribution with 20 degrees of freedom and are coloring the critical region in red where the null hypothesis would be rejected.

Check out my blog post about p-values. The plots I created were made in base R and are explaining visually the concept of p-values.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 |
# create x-values for distribution x <- data.frame(x = seq(-4, 4, length = 1000)) # plotting normal distribution with ggplot ggplot(data = x, mapping = aes(x = x)) + stat_function(fun = dnorm) # plotting t-distribution with ggplot t_distribution <- ggplot(data = x, mapping = aes(x = x)) + stat_function(fun = dt, args = list(df = 20)) # get y-values for corresponding x-values for t-distribution data <- data.frame(x = x, y = dt(x$x, df = 20)) # get two-tailed critical value for t-distribution # with 20 degrees of freedom and alpha = 0.05 critical_value <- qt(0.025, 20) t_distribution + geom_area(data = subset(data, x <= critical_value), mapping = aes(x = x, y = y), fill = "red") + geom_area(data = subset(data, x >= abs(critical_value)), mapping = aes(x = x, y = y), fill = "red") + geom_vline(xintercept = c(critical_value, abs(critical_value))) + annotate("text", x = -3.5, y = 0.375, label = "alpha = 0.05") + annotate("text", x = -3.25, y = 0.05, label = paste("critical value = ", round(critical_value, digits = 3)), col = "red") + annotate("text", x = 3.25, y = 0.05, label = paste("critical value = ", round(abs(critical_value), digits = 3)), col = "red") |

With such a kind of plot you can get an intuitive understanding about p-values and probabilities. However, for our Pokemon data set this kind of plot is not very helpful.

Let’s check out another `stat`

function that is more informative when working with data sets. With the `stat_summary()`

function, we can summarise y values at distinct x values. For example:

1 2 3 |
ggplot(data = poke_mod, mapping = aes(x = Type.1, y = Speed)) + geom_jitter(width = 0.1) + stat_summary(geom = "point", fun.y = "median", size = 5, aes(colour = Type.1)) |

The `stat_summary()`

function calculates the median speed of each Pokemon type. Looks like fire Pokemon are faster on average than grass or water Pokemon. This means, that they will on average attack first.

In the graph above, instead of `geom_point`

, I used `geom_jitter`

. `geom_jitter`

Â spreads the points apart so we can have a better understanding of where the density of points is highest.

# Coordinates

Coordinates are all about displaying your data in the right way. They are super useful and contribute a lot to the overall information that a plot conveys.

1 2 3 4 5 6 7 8 9 10 11 |
# order count of type highest_to_lowest <- poke %>% dplyr::group_by(., Type.1) %>% dplyr::summarise(n = n()) %>% dplyr::arrange(., n) %>% dplyr::pull(., Type.1) %>% base::unique(.) poke$Type.1 <- factor(poke$Type.1, levels = highest_to_lowest) ggplot(data = poke, mapping = aes(x = Type.1)) + geom_bar() |

This does not look bad, however, we have a lot of Pokemon types on our x-axis and some letters are overlapping. A better solution would be to flip the x and y axis.

1 2 3 |
ggplot(data = poke, mapping = aes(x = Type.1)) + geom_bar() + coord_flip() |

Now, we can read the Pokemon types way better. When talking about coordinates, we can also talk about scales such as `scale_x/y_continuous()`

etc. These might be useful when you are interested in a specific part of a plot. Let’s consider the histogram below. We want to examine the highest attacking values a little bit more closely.

1 2 |
ggplot(data = poke_mod, mapping = aes(x = Attack)) + geom_histogram() |

1 2 3 |
ggplot(data = poke_mod, mapping = aes(x = Attack)) + geom_histogram() + scale_x_continuous(limits = c(130, 180), breaks = seq(130, 180, 5)) |

Now, the graph only shows attacking values that are greater than 130 and lower than 180. `breaks = seq(130, 180, 5)`

means that we want to go from 130 to 180 in steps of 5. So there should be a tick at 130, 135, 140, …, 180.

Other useful transformations are log transformations on the data set. This is pretty useful when data is very cluttered. Look for example at the graph below. Most values are close to zero. Therefore, a log transformation would be quite useful to make more sense of the graphic. Let’s consider the graph below from the gapminder data set.

1 2 3 4 5 |
library(gapminder) ggplot(data = gapminder, aes(x = gdpPercap, y = lifeExp)) + geom_point() + geom_smooth(method = "lm", formula = y ~ log(x)) |

In order to have a better look at the data, we can log transform the x-axis with `scale_x_log10()`

and we get a better picture of the relationship between life expectancy and GDP per capita.

1 2 3 4 |
ggplot(data = gapminder, aes(x = gdpPercap, y = lifeExp)) + geom_point() + geom_smooth(method = "lm") + scale_x_log10() |

There is not much more to say about coordinates. Many more things are possible but we discussed the most important and most used ones.

Next, we are talking about themes. In this context, we are also talking about how to improve legends and axis labels. Themes are all about how a graph looks.

# Themes

1 2 3 4 |
ggplot(data = poke_mod, mapping = aes(x = Attack, y = Defense, color = Type.1)) + geom_point() + ggtitle("Defending/Attacking Scatter Plot By Pokemon Type") + theme_minimal() |

We notice that the background changed with the new theme choice. We also added a title to the graphic with `ggtitle()`

. However, we notice that the title of the ggplot is not centered.

1 2 3 4 5 6 7 |
ggplot(data = poke_mod, mapping = aes(x = Attack, y = Defense, color = Type.1)) + geom_point() + ggtitle("Defending/Attacking Scatter Plot By Pokemon Type") + xlab("Base Attacking Skills") + ylab("Base Defending Skills") + theme_minimal() + theme(plot.title = element_text(hjust = 0.5)) |

Now, the title is centered and we changed the x-labels and y-labels to be more informative.

Now let’s make the legend and the overall plot a bit nicer.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
ggplot(data = poke_mod, mapping = aes(x = Attack, y = Defense, color = Type.1)) + geom_point() + ggtitle("Defending/Attacking Scatter Plot By Pokemon Type") + xlab("Base Attacking Skills") + ylab("Base Defending Skills") + theme_minimal() + theme(plot.title = element_text(hjust = 0.5, size = 20), legend.position = "bottom", legend.title = element_text(size = 15), legend.text = element_text(size = 12), axis.title.x = element_text(size = 15), axis.text.x = element_text(size = 15), axis.title.y = element_text(size = 15), axis.text.y = element_text(size = 15)) + scale_color_discrete(name = "Pokemon Type", labels = c("Fire Pokemon", "Grass Pokemon", "Water Pokemon")) |

This plot is much better.

Notice that we made the axis ticks and axis labels bigger with

– `axis.title.x = element_text(size = 15)`

. Enlarging the x-axis title.

– `axis.ticks.x = element_text(size = 15)`

. Enlarging the x-axis ticks.

– `axis.title.y = element_text(size = 15)`

. Enlarging the y-axis title.

– `axis.ticks.y = element_text(size = 15)`

. Enlarging the y-axis ticks.

This is really helpful for presentations so that the audience can clearly see what is going on. Without being able to read the axis labels properly, a plot loses its meaning.

We moved the legend to the bottom with:

– `legend.position = "bottom"`

We changed the legend title and labels with:

– `scale_color_discrete()`

We added a title and centered it with:

– `ggtitle()`

– `plot.title = element_text(hjust = 0.5, size = 20)`

There are many more things that I haven’t covered in the two ggplot tutorials. However, the basics are most important. Now you have the ability to go and google solutions for your own graphics and are able to understand a solution’s code.

This was the second part of the grammar of graphics ggplot2 tutorial. Now you know about all the ggplot layers. If you have any questions or suggestions, let me know in the comments below.