The Grammar of Graphics – Pokemon ggplot Tutorial Part2

Last tutorial, we learned the essential elements of a ggplot2 graphic. That is:

  • Data element
  • Aesthetics element
  • Geometries element

These three elements are the essential blocks for a successful ggplot2 graphic.

Now in this tutorial, we will be explaining the other four remaining elements:

  • Facets
  • Statistics
  • Coordinates
  • Themes

These four elements are not essential to creating a ggplot2 graphic but are of great help when wanting to tell a story with a data set. Again, we will be using the Pokemon data set from Kaggle to illustrate concepts.

Faceting

We already discussed the aesthetics element and the color attribute last time when we compared different groups. We produced a graph such as this one:

ggplot tutorial

However, there is an alternative approach when comparing groups. Faceting takes subgroups of the data and displays them in separate plots.

For the plot above, we can see that each Pokemon type is colored differently. A lot of these points are overlapping and it is hard to see differences between groups. Faceting tries to make differences more visible when we have a lot of differently colored points that are overlapping. Let’s try it out.

ggplot tutorial

In the above graph, we used the function facet_grid() to display three separate plots. Each subgroup is plotted on their own. This cols = vars(Type.1) means that we want a single row with multiple columns. This is a great way to make comparisons of y positions.

On the other hand if had specified rows = vars(Type.1) then it would have created three rows with one column. This is particularly useful when one wants to compare x positions. It is also especially useful to compare distributions like below:

ggplot tutorial pokemons

In the case above, we could compare distributions very easily because there are roughly the same amount of fire, grass, and water Pokemon in the data set. However, what happens when there is an imbalance in the data set?

Let us examine what happens when there are six times the amount of water Pokemon as there are originally. We multiply the amount of water Pokemon with the code below and then show the same plot again as previously.

ggplot tutorial

As you can see this graph makes it hard to identify the counts for fire and grass Pokemon. This is because the scale of the y-axis adjusted to the increase in water Pokemon. How do we go about that? There is an argument in the facet_grid() function which is scales = "free". The graph would look like this:

ggplot tutorial pokemon

Now we can actually see that the mode of fire Pokemon is around 80 attacking points and the mode of grass Pokemon is around 50 attacking points.

Another useful tool when faceting is the facet_wrap() function when faceting by multiple variables. Let’s facet by type and if a Pokemon is legendary or not.

ggplot tutorial pokemon

This looks a bit messy because we can’t really compare legendary Pokemon for different types very well. Let’s try another layout of the panels in the grid.

With the argument ncol = 2 in the facet_wrap() function, you can specify how many columns you want. The default column in the previous plot was three. Now, we want to have only two columns.

ggplot tutorial pokemon

This looks much better and we can compare legendary Pokemon much better across types. Notice, that we also let the y-axis to be free with the argument scales = "free_y".

It look like most legendary Pokemon are of type water. Legendary grass Pokemon have the lowest attacking points on average and the least legendary Pokemon.

Now that we discussed the most important points about faceting, we can move on to the statistics element.

Statistics

Believe it or not but you already used a lot of stats elements when creating ggplots. You have used these stats when for example creating a scatter plot with a smooth curve.

ggplot tutorial pokemon

In the above graph, we used stat_smooth behind the scenes to create the smoother curves. With the span = 0.7 argument, you can specify the “wiggliness” of the curves.

Another example where we have used statistical transformations to create a plot would be a histogram. For example:

ggplot tutorial pokemon

In the histogram above, geom_histogram() uses stat_bin() to use default bins and count the number of observations in each bin. What stat_bin() does:

It takes the data frame and creates these variables based on the data frame:

  • count, the number of observations in each bin
  • density, the density of observations in each bin
  • x, the center of the bin

ggplot tutorial Pokemon

As you can see, geom_histogram and geom_bar(stat = "bin") are producing the exact same results. The only difference is that geom_histogram() is using stat_bin under the hood.

Enough with the boring stuff now. It is like knowing about how a linear regression algorithm works in R versus just applying the lm() function. Most importantly, you have to know how to use it and interpret it and then later on you might be interested in knowing how the details work out.

More useful stat functions are for example the stat_function() which let’s you plot functions and distributions. Below, we are plotting a t-distribution with 20 degrees of freedom and are coloring the critical region in red where the null hypothesis would be rejected.

Check out my blog post about p-values. The plots I created were made in base R and are explaining visually the concept of p-values.

ggplot tutorial pokemon

With such a kind of plot you can get an intuitive understanding about p-values and probabilities. However, for our Pokemon data set this kind of plot is not very helpful.

Let’s check out another stat function that is more informative when working with data sets. With the stat_summary() function, we can summarise y values at distinct x values. For example:

ggplot tutorial pokemon

The stat_summary() function calculates the median speed of each Pokemon type. Looks like fire Pokemon are faster on average than grass or water Pokemon. This means, that they will on average attack first.

In the graph above, instead of geom_point, I used geom_jitter. geom_jitter spreads the points apart so we can have a better understanding of where the density of points is highest.

Coordinates

Coordinates are all about displaying your data in the right way. They are super useful and contribute a lot to the overall information that a plot conveys.

ggplot tutorial pokemon

This does not look bad, however, we have a lot of Pokemon types on our x-axis and some letters are overlapping. A better solution would be to flip the x and y axis.

ggplot tutorial pokemon

Now, we can read the Pokemon types way better. When talking about coordinates, we can also talk about scales such as scale_x/y_continuous() etc. These might be useful when you are interested in a specific part of a plot. Let’s consider the histogram below. We want to examine the highest attacking values a little bit more closely.

ggplot tutorial pokemon

ggplot tutorial pokemon

Now, the graph only shows attacking values that are greater than 130 and lower than 180. breaks = seq(130, 180, 5) means that we want to go from 130 to 180 in steps of 5. So there should be a tick at 130, 135, 140, …, 180.

Other useful transformations are log transformations on the data set. This is pretty useful when data is very cluttered. Look for example at the graph below. Most values are close to zero. Therefore, a log transformation would be quite useful to make more sense of the graphic. Let’s consider the graph below from the gapminder data set.

ggplot tutorial pokemon

In order to have a better look at the data, we can log transform the x-axis with scale_x_log10() and we get a better picture of the relationship between life expectancy and GDP per capita.

ggplot tutorial pokemon

There is not much more to say about coordinates. Many more things are possible but we discussed the most important and most used ones.

Next, we are talking about themes. In this context, we are also talking about how to improve legends and axis labels. Themes are all about how a graph looks.

Themes

ggplot tutorial pokemon

We notice that the background changed with the new theme choice. We also added a title to the graphic with ggtitle(). However, we notice that the title of the ggplot is not centered.

ggplot tutorial pokemon

Now, the title is centered and we changed the x-labels and y-labels to be more informative.

Now let’s make the legend and the overall plot a bit nicer.

ggplot tutorial pokemon

This plot is much better.

Notice that we made the axis ticks and axis labels bigger with

axis.title.x = element_text(size = 15). Enlarging the x-axis title.
axis.ticks.x = element_text(size = 15). Enlarging the x-axis ticks.
axis.title.y = element_text(size = 15). Enlarging the y-axis title.
axis.ticks.y = element_text(size = 15). Enlarging the y-axis ticks.

This is really helpful for presentations so that the audience can clearly see what is going on. Without being able to read the axis labels properly, a plot loses its meaning.

We moved the legend to the bottom with:

legend.position = "bottom"

We changed the legend title and labels with:

scale_color_discrete()

We added a title and centered it with:

ggtitle()
plot.title = element_text(hjust = 0.5, size = 20)

There are many more things that I haven’t covered in the two ggplot tutorials. However, the basics are most important. Now you have the ability to go and google solutions for your own graphics and are able to understand a solution’s code.

This was the second part of the grammar of graphics ggplot2 tutorial. Now you know about all the ggplot layers. If you have any questions or suggestions, let me know in the comments below.