The Grammar of Graphics – Pokemon ggplot Tutorial Part2
February 21, 2019 By Pascal Schmidt R Tidyverse Tutorial
Last tutorial, we learned the essential elements of a ggplot2 graphic. That is:
- Data element
- Aesthetics element
- Geometries element
These three elements are the essential blocks for a successful ggplot2 graphic.
Now in this tutorial, we will be explaining the other four remaining elements:
- Facets
- Statistics
- Coordinates
- Themes
These four elements are not essential to creating a ggplot2 graphic but are of great help when wanting to tell a story with a data set. Again, we will be using the Pokemon data set from Kaggle to illustrate concepts.
library(tidyverse) poke <- read.csv("Pokemon.csv") poke <- poke %>% dplyr::mutate_if(., is.factor, as.character) poke_mod <- poke %>% dplyr::filter(., stringr::str_detect("(Grass|Water|Fire)", Type.1)) str(poke_mod) # $ X. : int 1 2 3 3 4 5 6 6 6 7 ... # $ Name : chr "Bulbasaur" "Ivysaur" "Venusaur" "VenusaurMega Venusaur" ... # $ Type.1 : chr "Grass" "Grass" "Grass" "Grass" ... # $ Type.2 : chr "Poison" "Poison" "Poison" "Poison" ... # $ Total : int 318 405 525 625 309 405 534 634 634 314 ... # $ HP : int 45 60 80 80 39 58 78 78 78 44 ... # $ Attack : int 49 62 82 100 52 64 84 130 104 48 ... # $ Defense : int 49 63 83 123 43 58 78 111 78 65 ... # $ Sp..Atk : int 65 80 100 122 60 80 109 130 159 50 ... # $ Sp..Def : int 65 80 100 120 50 65 85 85 115 64 ... # $ Speed : int 45 60 80 80 65 80 100 100 100 43 ... # $ Generation: int 1 1 1 1 1 1 1 1 1 1 ... # $ Legendary : chr "False" "False" "False" "False" ...
Faceting
We already discussed the aesthetics element and the color attribute last time when we compared different groups. We produced a graph such as this one:
ggplot(data = poke_mod, mapping = aes(x = Attack, y = Defense, color = Type.1)) + geom_point()
However, there is an alternative approach when comparing groups. Faceting takes subgroups of the data and displays them in separate plots.
For the plot above, we can see that each Pokemon type is colored differently. A lot of these points are overlapping and it is hard to see differences between groups. Faceting tries to make differences more visible when we have a lot of differently colored points that are overlapping. Let’s try it out.
ggplot(data = poke_mod, mapping = aes(x = Attack, y = Defense, color = Type.1)) + geom_point() + facet_grid(cols = vars(Type.1))
In the above graph, we used the function facet_grid()
to display three separate plots. Each subgroup is plotted on their own. This cols = vars(Type.1)
means that we want a single row with multiple columns. This is a great way to make comparisons of y positions.
On the other hand if had specified rows = vars(Type.1)
then it would have created three rows with one column. This is particularly useful when one wants to compare x positions. It is also especially useful to compare distributions like below:
ggplot(data = poke_mod, mapping = aes(x = Attack, fill = Type.1)) + geom_histogram() + facet_grid(rows = vars(Type.1))
In the case above, we could compare distributions very easily because there are roughly the same amount of fire, grass, and water Pokemon in the data set. However, what happens when there is an imbalance in the data set?
Let us examine what happens when there are six times the amount of water Pokemon as there are originally. We multiply the amount of water Pokemon with the code below and then show the same plot again as previously.
poke_mod %>% base::split(., .$Type.1) %>% purrr::map(~ dplyr::filter(., Type.1 == "Water")) %>% base::Filter(function(x) dim(x)[1] > 1, .) %>% base::rep(., 5) %>% base::do.call(rbind, .) %>% base::rbind(., poke_mod) -> six_times_water ggplot(data = six_times_water, mapping = aes(x = Attack, fill = Type.1)) + geom_histogram() + facet_grid(rows = vars(Type.1))
As you can see this graph makes it hard to identify the counts for fire and grass Pokemon. This is because the scale of the y-axis adjusted to the increase in water Pokemon. How do we go about that? There is an argument in the facet_grid()
function which is scales = “free”
. The graph would look like this:
ggplot(data = six_times_water, mapping = aes(x = Attack, fill = Type.1)) + geom_histogram() + facet_grid(rows = vars(Type.1), scales = "free")
Now we can actually see that the mode of fire Pokemon is around 80 attacking points and the mode of grass Pokemon is around 50 attacking points.
Another useful tool when faceting is the facet_wrap()
function when faceting by multiple variables. Let’s facet by type and if a Pokemon is legendary or not.
ggplot(data = poke_mod, mapping = aes(x = Attack, fill = Type.1)) + geom_histogram() + facet_wrap(vars(Type.1, Legendary))
This looks a bit messy because we can’t really compare legendary Pokemon for different types very well. Let’s try another layout of the panels in the grid.
With the argument ncol = 2
in the facet_wrap()
function, you can specify how many columns you want. The default column in the previous plot was three. Now, we want to have only two columns.
ggplot(data = poke_mod, mapping = aes(x = Attack, fill = Type.1)) + geom_histogram() + facet_wrap(vars(Type.1, Legendary), ncol = 2, scales = "free_y")
This looks much better and we can compare legendary Pokemon much better across types. Notice, that we also let the y-axis to be free with the argument scales = “free_y”
.
It look like most legendary Pokemon are of type water. Legendary grass Pokemon have the lowest attacking points on average and the least legendary Pokemon.
Now that we discussed the most important points about faceting, we can move on to the statistics element.
Statistics
Believe it or not but you already used a lot of stats elements when creating ggplots. You have used these stats when for example creating a scatter plot with a smooth curve.
ggplot(data = poke_mod, mapping = aes(x = Attack, y = Defense, col = Type.1)) + geom_point() + geom_smooth(se = FALSE, span = 0.7)
In the above graph, we used stat_smooth behind the scenes to create the smoother curves. With the span = 0.7
argument, you can specify the “wiggliness” of the curves.
Another example where we have used statistical transformations to create a plot would be a histogram. For example:
ggplot(data = poke_mod, mapping = aes(x = Attack)) + geom_histogram()
In the histogram above, geom_histogram()
uses stat_bin()
to use default bins and count the number of observations in each bin. What stat_bin()
does:
It takes the data frame and creates these variables based on the data frame:
count
, the number of observations in each bindensity
, the density of observations in each binx
, the center of the bin
ggplot(data = poke_mod, mapping = aes(x = Attack)) + geom_bar(stat = "bin")
As you can see, geom_histogram
and geom_bar(stat = “bin”)
are producing the exact same results. The only difference is that geom_histogram()
is using stat_bin
under the hood.
Enough with the boring stuff now. It is like knowing about how a linear regression algorithm works in R versus just applying the lm()
function. Most importantly, you have to know how to use it and interpret it and then later on you might be interested in knowing how the details work out.
More useful stat
functions are for example the stat_function()
which let’s you plot functions and distributions. Below, we are plotting a t-distribution with 20 degrees of freedom and are coloring the critical region in red where the null hypothesis would be rejected.
Check out my blog post about p-values. The plots I created were made in base R and are explaining visually the concept of p-values.
# create x-values for distribution x <- data.frame(x = seq(-4, 4, length = 1000)) # plotting normal distribution with ggplot ggplot(data = x, mapping = aes(x = x)) + stat_function(fun = dnorm) # plotting t-distribution with ggplot t_distribution <- ggplot(data = x, mapping = aes(x = x)) + stat_function(fun = dt, args = list(df = 20)) # get y-values for corresponding x-values for t-distribution data <- data.frame(x = x, y = dt(x$x, df = 20)) # get two-tailed critical value for t-distribution # with 20 degrees of freedom and alpha = 0.05 critical_value <- qt(0.025, 20) t_distribution + geom_area(data = subset(data, x <= critical_value), mapping = aes(x = x, y = y), fill = "red") + geom_area(data = subset(data, x >= abs(critical_value)), mapping = aes(x = x, y = y), fill = "red") + geom_vline(xintercept = c(critical_value, abs(critical_value))) + annotate("text", x = -3.5, y = 0.375, label = "alpha = 0.05") + annotate("text", x = -3.25, y = 0.05, label = paste("critical value = ", round(critical_value, digits = 3)), col = "red") + annotate("text", x = 3.25, y = 0.05, label = paste("critical value = ", round(abs(critical_value), digits = 3)), col = "red")
With such a kind of plot, you can get an intuitive understanding of p-values and probabilities. However, for our Pokemon data set this kind of plot is not very helpful.
Let’s check out another stat
function that is more informative when working with data sets. With the stat_summary()
function, we can summarise y values at distinct x values. For example:
ggplot(data = poke_mod, mapping = aes(x = Type.1, y = Speed)) + geom_jitter(width = 0.1) + stat_summary(geom = "point", fun.y = "median", size = 5, aes(colour = Type.1))
The stat_summary()
function calculates the median speed of each Pokemon type. Looks like fire Pokemon are faster on average than grass or water Pokemon. This means, that they will on average attack first.
In the graph above, instead of geom_point
, I used geom_jitter
. geom_jitter
spreads the points apart so we can have a better understanding of where the density of points is highest.
Coordinates
Coordinates are all about displaying your data in the right way. They are super useful and contribute a lot to the overall information that a plot conveys.
# order count of type highest_to_lowest <- poke %>% dplyr::group_by(., Type.1) %>% dplyr::summarise(n = n()) %>% dplyr::arrange(., n) %>% dplyr::pull(., Type.1) %>% base::unique(.) poke$Type.1 <- factor(poke$Type.1, levels = highest_to_lowest) ggplot(data = poke, mapping = aes(x = Type.1)) + geom_bar()
This does not look bad, however, we have a lot of Pokemon types on our x-axis and some letters are overlapping. A better solution would be to flip the x and y axis.
ggplot(data = poke, mapping = aes(x = Type.1)) + geom_bar() + coord_flip()
Now, we can read the Pokemon types way better. When talking about coordinates, we can also talk about scales such as scale_x/y_continuous()
etc. These might be useful when you are interested in a specific part of a plot. Let’s consider the histogram below. We want to examine the highest attacking values a little bit more closely.
ggplot(data = poke_mod, mapping = aes(x = Attack)) + geom_histogram()
ggplot(data = poke_mod, mapping = aes(x = Attack)) + geom_histogram() + scale_x_continuous(limits = c(130, 180), breaks = seq(130, 180, 5))
Now, the graph only shows attacking values that are greater than 130 and lower than 180. breaks = seq(130, 180, 5)
means that we want to go from 130 to 180 in steps of 5. So there should be a tick at 130, 135, 140, …, 180.
Other useful transformations are log transformations on the data set. This is pretty useful when data is very cluttered. Look for example at the graph below. Most values are close to zero. Therefore, a log transformation would be quite useful to make more sense of the graphic. Let’s consider the graph below from the gapminder
data set.
library(gapminder) ggplot(data = gapminder, aes(x = gdpPercap, y = lifeExp)) + geom_point() + geom_smooth(method = "lm", formula = y ~ log(x))
In order to have a better look at the data, we can log transform the x-axis with scale_x_log10()
and we get a better picture of the relationship between life expectancy and GDP per capita.
ggplot(data = gapminder, aes(x = gdpPercap, y = lifeExp)) + geom_point() + geom_smooth(method = "lm") + scale_x_log10()
There is not much more to say about coordinates. Many more things are possible but we discussed the most important and most used ones.
Next, we are talking about themes. In this context, we are also talking about how to improve legends and axis labels. Themes are all about how a graph looks.
Themes
ggplot(data = poke_mod, mapping = aes(x = Attack, y = Defense, color = Type.1)) + geom_point() + ggtitle("Defending/Attacking Scatter Plot By Pokemon Type") + theme_minimal()
We notice that the background changed with the new theme choice. We also added a title to the graphic with ggtitle()
. However, we notice that the title of the ggplot is not centered.
ggplot(data = poke_mod, mapping = aes(x = Attack, y = Defense, color = Type.1)) + geom_point() + ggtitle("Defending/Attacking Scatter Plot By Pokemon Type") + xlab("Base Attacking Skills") + ylab("Base Defending Skills") + theme_minimal() + theme(plot.title = element_text(hjust = 0.5))
Now, the title is centered and we changed the x-labels and y-labels to be more informative.
Now let’s make the legend and the overall plot a bit nicer.
ggplot(data = poke_mod, mapping = aes(x = Attack, y = Defense, color = Type.1)) + geom_point() + ggtitle("Defending/Attacking Scatter Plot By Pokemon Type") + xlab("Base Attacking Skills") + ylab("Base Defending Skills") + theme_minimal() + theme(plot.title = element_text(hjust = 0.5, size = 20), legend.position = "bottom", legend.title = element_text(size = 15), legend.text = element_text(size = 12), axis.title.x = element_text(size = 15), axis.text.x = element_text(size = 15), axis.title.y = element_text(size = 15), axis.text.y = element_text(size = 15)) + scale_color_discrete(name = "Pokemon Type", labels = c("Fire Pokemon", "Grass Pokemon", "Water Pokemon"))
This plot is much better.
Notice that we made the axis ticks and axis labels bigger with
– axis.title.x = element_text(size = 15)
. Enlarging the x-axis title.
– axis.ticks.x = element_text(size = 15)
. Enlarging the x-axis ticks.
– axis.title.y = element_text(size = 15)
. Enlarging the y-axis title.
– axis.ticks.y = element_text(size = 15)
. Enlarging the y-axis ticks.
This is really helpful for presentations so that the audience can clearly see what is going on. Without being able to read the axis labels properly, a plot loses its meaning.
We moved the legend to the bottom with:
– legend.position = “bottom”
We changed the legend title and labels with:
– scale_color_discrete()
We added a title and centered it with:
– ggtitle()
– plot.title = element_text(hjust = 0.5, size = 20)
There are many more things that I haven’t covered in the two ggplot tutorials. However, the basics are most important. Now you have the ability to go and google solutions for your own graphics and are able to understand a solution’s code.
This was the second part of the grammar of graphics ggplot2 tutorial. Now you know about all the ggplot layers. If you have any questions or suggestions, let me know in the comments below.
Recent Posts
Recent Comments
- Kardiana on The Lasso – R Tutorial (Part 3)
- Pascal Schmidt on RSelenium Tutorial: A Tutorial to Basic Web Scraping With RSelenium
- Pascal Schmidt on Dynamic Tabs, insertTab, and removeTab For More efficient R Shiny Applications
- Gisa on Persistent Data Storage With a MySQL Database in R Shiny – An Example App
- Nicholas on Dynamic Tabs, insertTab, and removeTab For More efficient R Shiny Applications
Comments (4)
Hi Pascal, thanks for a great post. Could you pls specify more abt the “span = 0.7 argument”, and the “wiggliness” of the curves? in what case we might need to use the span argument?
Consider this example,
data.frame(x = sample(1:100, size = 100, replace = TRUE),
y = sample(1:100, size = 100, replace = TRUE)) -> df
ggplot(df, aes(x = x, y = y)) +
geom_point() +
geom_smooth(span = 0.75, se = FALSE)
ggplot(df, aes(x = x, y = y)) +
geom_point() +
geom_smooth(span = 0.25, se = FALSE)
For the first one, the span is set to 0.75. R uses a weighted average to calculate the line of best fit within a particular window. When the span is larger, then more points are being taking into the calculation of the weighted average to come up with a good fit (bigger window).
On the other hand, if the span is smaller, then fewer points are being considered when calculating the wighted average of a particular window in the plot. Therefore, the line appears to be more wiggly.
For example, when you have a scatterplot with only a few data points, then you won’t be able to set the span argument to a very low number because in a particular window, there aren’t enough data points available to calculate a weighted average for the best line fit. You can try this out yourself 🙂
Hope this helps!
Hi, I’m also studying statistics at SFU. I’m in my fourth year and looking at your blog kind of discourages me because your blog is great. I haven’t done any co-op and I feel like I won’t be able to get a job after I graduate. I think I can make basic graphs, do basic cleaning of data, but I’m lost on machine learning techniques. Python is a requirement for a lot of data science jobs these days and I don’t have experience with programming. I just feel kind of down because I don’t know how to start and what to practice/learn. Do you have tips/ guidelines on how I should start? I hope I can get some sort of data analyst position in the future but there’s so much competition these days and I don’t think I’m as smart as the others in my class. I’m in my fourth year, thinking of doing co-op but the requirement is 3 semesters of co-op, which will delay my graduation (If I can even land a co-op job lol). My GPA isn’t good and I’m just worried about my future and I really need to start making a change in my life. Thanks so much for reading this. Your blog is really informative and great, it motivates me but also shows how much I have to learn and how little I know.
Hi man,
I appreciate your kind words regarding the blog. I had a look at your github and it looks good. Try building projects, join hackathons, and analyse data sets you find interesting. Definitely do co-op. It does not matter if you are graduating at the age of 22, 23, or 24. 2 years of “figuring things out” are nothing compared to the other 45 years you will be spending working. Co-op taught me so much more than school because you are surrounded by great mentors all day long.
I do not know so many things and you just learn as you go and you forget a lot of things as you go again. I am currently in Ottawa but will be in Vancouver for my last semester. Let’s definitely meet up in Vancouver. It’s all about perspective. Email me if you want to talk some more – schmidtpascal553@gmail.com