February 13, 2019 By Pascal Schmidt R Tidyverse Tutorial

ggplot2 is an R package for producing data visualizations. It is based on the Grammar of Graphics by Leland Wilkinson and is the most used package for producing graphics in R. This tells you that ggplot2 is worth the effort of learning. So let’s get you started with it!

ggplot2 consists of the following elements:

Essential Elements

- Data

The data element is the data set itself

- Aesthetics

The data is being mapped onto the aesthetics element (variables mapped to x or y position and aesthetics attributes such as color, shape, or size)

- Geometries

This element determines how our data is being displayed (bars, points, lines)

Every single plot that you will ever make consists of these three essential elements.

Optional

- Facets

Facetting splits the data into subsets and displays the same graph for every subset.

- Statistics

Let’s you transform our data (add mean, median, quartile)

- Coordinates

Transforms axes (changes spacing of displayed data)

- Themes

Let’s you change the graphics background, axis size, or header.

Let us use a Pokemon data set from Kaggle to pick up some ggplot2 knowledge.

str(poke)


'data.frame':	800 obs. of  13 variables:
 $ X.        : int  1 2 3 3 4 5 6 6 6 7 ...
 $ Name      : Factor w/ 800 levels "Abomasnow","AbomasnowMega Abomasnow",..: 81 330 746 747 103 104 100 101 102 666 ...
 $ Type.1    : Factor w/ 18 levels "Bug","Dark","Dragon",..: 10 10 10 10 7 7 7 7 7 18 ...
 $ Type.2    : Factor w/ 19 levels "","Bug","Dark",..: 15 15 15 15 1 1 9 4 9 1 ...
 $ Total     : int  318 405 525 625 309 405 534 634 634 314 ...
 $ HP        : int  45 60 80 80 39 58 78 78 78 44 ...
 $ Attack    : int  49 62 82 100 52 64 84 130 104 48 ...
 $ Defense   : int  49 63 83 123 43 58 78 111 78 65 ...
 $ Sp..Atk   : int  65 80 100 122 60 80 109 130 159 50 ...
 $ Sp..Def   : int  65 80 100 120 50 65 85 85 115 64 ...
 $ Speed     : int  45 60 80 80 65 80 100 100 100 43 ...
 $ Generation: int  1 1 1 1 1 1 1 1 1 1 ...
 $ Legendary : Factor w/ 2 levels "False","True": 1 1 1 1 1 1 1 1 1 1 ...

Data Element, Aesthetics Element, and Geometric Element

atk_vs_def <- ggplot(data = poke, mapping = aes(x = Attack, y = Defense))

ggplot2 plots are objects which means that you can assign them. If we would display the plot now then we wouldn’t see anything yet because we have not specified the geometric element yet. The first argument in ggplot() is the data frame. In the second argument, we map the data onto the x-axis and y-axis with the aesthetics element.

We have two continuous variables with Attack and Defense. Hence, a scatterplot would be appropriate.

You can create a scatter plot by adding the geom_point() function to our existing object, atk_vs_def. This would look like this:

atk_vs_def +
    geom_point()

It looks like there is a positive linear relationship between the Defense and Attack variables.

Structure of a ggplot:

ggplot(data = data set, mapping = aes(x = var in data, y = var in data) +

geom_whateverPlotYouWant()

The way we added the geom element was by the + sign. A powerful graphic can be easily achieved by only two lines of code. Amazing! However, the above graph looks a little bit boring and could be made more interesting. Let’s see how we would go about it.

Aesthetics Attributes

We could, for example, change the color, shape, or size of the points.

atk_vs_def +
   geom_point(shape = 1, size = 5, color = "red")

This looks definitely more interesting! Usually, we do not want to change the color, shape, or size of the plot for all of our data. Sure, the plot with the red points looks more interesting and catches our attention more than the plot with the black dots. However, we do not convey any more information. Data visualization is not only about beautiful plots, it is also (most importantly) about conveying as much information as possible with a clean and easy to understand graphic.

So let’s have a look at a more powerful graphic that uses the color, shape, and size attributes more effectively.

ggplot(data = poke, mapping = aes(x = Attack, y = Defense, 
                                  color = Legendary, 
                                  shape = as.factor(Generation), 
                                  size = HP)) +
   geom_point()

This graph conveys a lot of information but looks quit busy. However, you can see how putting the color, shape, and size attributes into the aesthetic element changes the plot. shape = as.factor(Generation) changes the shape depending on the generation of the individual Pokemon. According to the legend, all Pokemon from generation 1 are displayed with a circle etc. The size element works in the same way. The larger the hit points of a Pokemon, the larger the shape. Lastly, we colored legendary Pokemon in blue and in red otherwise.

A great thing about specifying these attributes in the aesthetic element is that we do not need to create a legend ourselves. ggplot2 takes care of that.

A less busy graph would look like this:

ggplot(data = poke, mapping = aes(x = Attack, y = Defense, color = Legendary)) +
    geom_point(alpha = 0.5)

It seems like legendary Pokemon have on average a higher defense and higher attacking power than “ordinary” ones.

In the plot above, we set the transparency of the dots in the geometric layer with alpha = 0.5 (alpha = 0 makes the points invisible and alpha = 1 is the default option).

`ggplot` vs. `geom` Elements

Now, we are learning a big lesson about ggplot2 during the next couple of graphs.

ggplot(data = poke, mapping = aes(x = Attack, y = Defense, color = Legendary)) +
    geom_point() +
    geom_smooth()

Above, the points and lines are colored according to the aesthetic element in the ggplot() function.

ggplot(data = poke, mapping = aes(x = Attack, y = Defense)) +
    geom_point(aes(color = Legendary)) +
    geom_smooth()

However, we can also put the aesthetic element with the color attribute in the geom_point() function like in the graph above. Notice what happens?

Equivalently, in the graph below, we put the color attribute together with the aesthetic element in the geom_smooth function.

ggplot(data = poke, mapping = aes(x = Attack, y = Defense)) +
    geom_point() +
    geom_smooth(aes(color = Legendary))

Now, we get two smooth curves colored by legendary vs. ordinary Pokemon. Notice what happens again? When we are specifying the color attribute in the global ggplot() function, it applied globally to every single geom. When we specified the aesthetics layer in only one geom function, then only this one got colored.

This concept is important to understand when working with more than one layer.

ggplot(data = poke, mapping = aes(x = Attack, y = Defense, color = Legendary)) +
    geom_point() +
    geom_smooth(aes(group = 1))

Global attributes can be overwritten by local ones such as with the plot above. Now, we are seeing only one smoothing curve for all Pokemon together.

A more obvious example of overwriting would be this one:

ggplot(data = poke, mapping = aes(x = Attack, y = Defense, color = Legendary)) +
    geom_point(aes(y = Speed))

Now we have overwritten Defense with Speed. However, the y-axis still says “Defense” and has not automatically adjusted to the overwritten statement.

We can also map continuous variables to the color attribute. However this is less common and makes much less sense in comparison to categorical variables.

Aesthetic Attributes for Continuous Variables

ggplot(data = poke, mapping = aes(x = Attack, y = Defense, color = Defense)) +
    geom_point()

Size can also be used with continuous variables but it is not advised to do so. The shape argument cannot be used for continuous variables. So the takeaway is that different types of aesthetic attributes work better with different types of variables.

So far, we have only worked with geom_smooth() and geom_point(). These geometric elements are good when working with continuous data. Therefore, let’s work with some categorical variables from the Pokemon data set.

I am only choosing to work with a subset of the pokemon data set. Specifically, I am only displaying graphics for fire, grass, and water pokemon.

poke <- poke %>%
    dplyr::mutate_if(., is.factor, as.character)

poke_mod <- poke %>%
    dplyr::filter(., stringr::str_detect("(Grass|Water|Fire)", Type.1))

Setting vs. Mapping

ggplot(data = poke_mod, mapping = aes(x = Type.1, col = "green")) +
    geom_bar()

In the plot above, we are mapping the color to “green”. This variable is not in our data set and ggplot2 creates a new variable only containing “green” and then scales it with a color scale.

You don’t want to do this. You want to set it to a constant instead. What you also noticed is that the color attribute only colors the frame of the bars. In order to color the entire bar, we have to use the fill atribute.

ggplot(data = poke_mod, mapping = aes(x = Type.1)) +
    geom_bar(fill = "green")

A very annoying color. Let’s make the plot prettier. Let’s apply what we have learned in previous plots and make the graph more informative.

Categorical Variables and Positioning

ggplot(data = poke_mod, mapping = aes(x = Type.1, fill = as.factor(Generation))) +
    geom_bar()

This is a stacked bar plot that shows how many Pokemon of each type are from which generation. It looks like most fire Pokemon come from generation 1. However, it is hard to read.

ggplot(data = poke_mod, mapping = aes(x = Type.1, fill = as.factor(Generation))) +
    geom_bar(position = "fill")

When we are specifying the position = “fill”, then we get a stacked bar plot that shows the percentage of each generation for a specific type rather than the count (like the default).

ggplot(data = poke_mod, mapping = aes(x = Type.1, fill = as.factor(Generation))) +
    geom_bar(position = "dodge")

When we specify position = “dodge”, then we can see the counts for each Generation next to each other for each type.

Another important geometric object is a box plot. We use a categorical variable for the x-axis and a continuous variable for the y-axis. Let’s figure out which type of Pokemon has the most attacking points.

ggplot(data = poke_mod, mapping = aes(x = Type.1, y = Attack)) +
    geom_boxplot()

Fire Pokemon have the highest median attacking points.

Again, we can further modify out plot to convey more information by creating a graphic for Pokemon by type and generation or legendary vs. ordinary.

ggplot(data = poke_mod, mapping = aes(x = Type.1, y = Attack, fill = as.factor(Legendary))) +
    geom_boxplot()

It looks like legendary Pokemon have on average higher attacking power. The greatest variance and highest median have water Pokemon.

Other useful geoms are histograms and density plots.

ggplot(data = poke_mod, mapping = aes(x = Speed, fill = as.factor(Legendary))) +
    geom_histogram()

ggplot(data = poke_mod, mapping = aes(x = Speed, fill = as.factor(Legendary))) +
    geom_density(alpha = 0.5)

Final Project

Let’s apply our newly gained ggplot2 knowledge with the ultimate question I asked myself when I was 8 years old.

Which Pokemon should I pick at the beginning? Bulbasaur, Charmander, or Squirtle?

I used to play with the blue edition on my Gameboy Color so Pikachu is not an option. From my child experience the best choice was Bulbasaur because I could get Articuno as an ice/water Pokemon and Moltres as a fire Pokemon. So the type I was missing was a grass Pokemon such as Bulbasaur.

Let’s make an educated data science decision by considering attacking points and defending points.

ggplot(data = poke_mod, mapping = aes(x = Attack, y = Defense, color = Type.1)) +
    geom_point()

Now let us introduce you to a new geom_text(). This geometric element is able to label Pokemon (dots) by their name. This let’s us assess the strength of the desired Pokemon. We do that by creating an aesthetic element in the geom_text() function and do label = Name. Name is a variable in the data frame with all the Pokemon names.

ggplot(data = poke_mod, mapping = aes(x = Attack, y = Defense, color = Type.1)) +
    geom_point(alpha = 0.5) +
    geom_text(aes(label = Name))

This looks very messy and we cannot really identify our desired Pokemon. Let’s get the desired Pokemon from the data frame and save it in an object called poke_childhood_decision. Don’t worry if you are not familiar with the code on how to select the desired Pokemons. What we are doing is to only select 9 Pokemons. So the starter Pokemons and their mutations.

Next, we want to plot our findings.

# identifying the desired pokemons and save in a new data frame
poke_childhood_decision <- poke_mod %>%
    dplyr::filter(., stringr::str_detect("(Bulbasaur|Charmander|Squirtle|
                                           Ivysaur|Charmeleon|Wartortle|
                                           Venusaur|Charizard|Blastoise)", 
                                           Name))

ggplot(data = poke_mod, mapping = aes(x = Attack, y = Defense, color = Type.1)) +
    geom_point(alpha = 0.5) +
    geom_text(data = poke_childhood_decision, aes(label = Name))

Better but still not good enough. An important take away from the plot above is that even the data frame in the ggplot() function can be overwritten in a geom element. We used the poke_childhood_decision data frame instead of the poke_mod data frame to label our Pokemon.

Let’s create a graph, where we can identify the points for our desired Pokemon better.

ggplot(data = poke_childhood_decision, mapping = aes(x = Attack, y = Defense, 
                                                     color = Type.1, 
                                                     label = Name)) +
    geom_point() +
    geom_text()

Better but why is there this weird “a” in the legend? Remember what we talked about global mapping in the ggplot() function? The legend that is being created by color = Type.1 is being created for both geometric elements. The geom_point() function and the geom_text(). Therefore, there is a point and the letter “a” visible in the legend.

So, we can overwrite the global mapping by a local mapping within the geom_text() function. We are doing this the following way:

ggplot(data = poke_childhood_decision, aes(x = Attack, y = Defense, 
                                           color = Type.1, label = Name)) +
    geom_point() +
    geom_text(show.legend = FALSE)

The argument show.legend let’s the legend disappear for the geom_text() function.

Let’s put the Pokemon names above their corresponding points and introduce a coordinates element which we put under the **optional** list in our introduction.

ggplot(data = poke_childhood_decision, aes(x = Attack, y = Speed, 
                                           color = Type.1, label = Name)) +
    geom_point() +
    geom_text(show.legend = FALSE, vjust = - 0.5) +
    scale_x_continuous(limits = c(35, 95))

From the plot, it is quite clear which Pokemon is best. It is Squirtle. Initially, it has the lowest attacking points but catches up in later stages when it becomes Blastois.

This was part 1 of the grammar of graphics ggplot2 tutorial. This one has covered all the essential elements of a ggplot2 plot. In the next part, I am going over the remaining four layers which will result in prettier and more readable graphics. If you have any questions or suggestions, let me know in the comments below.

Tags: data visualization ggplot2 R tidyverse

The Grammar Of Graphics – All You Need to Know About ggplot2 and Pokemons

February 13, 2019 By Pascal Schmidt R Tidyverse Tutorial

Data Element, Aesthetics Element, and Geometric Element

Aesthetics Attributes

`ggplot` vs. `geom` Elements

Aesthetic Attributes for Continuous Variables

Setting vs. Mapping

Categorical Variables and Positioning

Final Project

Post your comment

Recent Posts

Recent Comments

Categories

Bookmarks

The Grammar Of Graphics – All You Need to Know About ggplot2 and Pokemons

February 13, 2019 By Pascal Schmidt R Tidyverse Tutorial

Data Element, Aesthetics Element, and Geometric Element

Aesthetics Attributes

ggplot vs. geom Elements

Aesthetic Attributes for Continuous Variables

Setting vs. Mapping

Categorical Variables and Positioning

Final Project

Post your comment

Recent Posts

Recent Comments

Categories

Bookmarks

`ggplot` vs. `geom` Elements