A Short Tutorial about Magrittr’s Pipe Operator and Placeholders

magrittr’s pipe operator, %>% is one of the most powerful operations in data wrangling and helps you to keep your code:

clean and readable
maintainable

The magrittr pipe works essentially the same way as the + sign in ggplot2. If you need a quick reminder on how the plus sign is used in ggplot2 then here are two tutorials you can go through.

A Short Overview of the Magrittr’s Pipe Operator (%>%)

To demonstrate the above advantages of the pipe operator, consider the following example.

round(cos(exp(sin(log10(sqrt(25))))), 2)

# -0.33

The code above looks messy and it is cumbersome to step through all the different functions and also keep track of the brackets when writing the code.

The method below uses magrittr‘s pipe (%>%) and makes the function calls easier to understand.

library(magrittr)

sqrt(25) %>%
  log10() %>%
  sin() %>%
  exp() %>%
  cos() %>%
  round(2)

# -0.33

The pipe operator takes the result of the square root and puts it into the log10() function. Then this result is in turn put into the sin() function, then into the exp() function, then into the cos() function, and then we finally round our result up to two digits after the comma.

So what the pipe is doing is putting the result from the left-hand side (or in our case from above) and put it into the right hand side (or in our case into the code below).

Now the words clean and readable and maintainable make much more sense. The code we wrote with the pipe is easier to read and we are spotting mistakes easily. For example, if code written with the pipe throws errors, we can spot the exact location where the code breaks. Going from there, we can easily debug our code.

Basic Pipe Structure and Placeholders

Imagine, that you have a function that takes two arguments: function(argument1, argument2). If you wanted to write the piece of code you would write argument1 %>% function(argument2), where the pipe operator automatically places argument1 as the first argument of the function.

Now imagine that you want to do the same as above just differently. So, you can start with argument2 instead of argument1. This would look like this: argument2 %>% function(argument1, .). Now, we have to place a dot where the function expects its second argument to be. The dot is called a placeholder and is telling the pipe operator to put the second argument there.

So remember, the pipe automatically puts the code produced on the left-hand side of the pipe operator as a first argument into the following function on the right-hand side.

So the following code is equivalent:

– argument1 %>% function(., argument2)
– argument1 %>% function(argument2)

Let’s use an example from the gapminder data set.

Curly Brackets {} and the Magrittr’s Pipe Operator

Consider the code below. We filtered for the continent Asia, then pulled out the gdpPercap column vector, then rounded to two digits after the comma, and then displayed the first ten elements of the vector.

gapminder %>%
  dplyr::filter(continent == "Asia") %>%
  dplyr::pull(gdpPercap) %>% round(2) %>% 
  head(10)

# [1] 779.45 820.85 853.10 836.20 739.98 786.11 978.01 852.40 649.34 635.34

Now consider the code below. How would we achieve to get the same output as above when putting the head() and round() function on the same line as nested function calls?

gapminder %>%dplyr::filter(continent == "Asia") %>%
  dplyr::pull(gdpPercap) %>% 
  head(round(2), 10)

# [1] 779.4453 820.8530

gapminder %>%
  dplyr::filter(continent == "Asia") %>%
  dplyr::pull(gdpPercap) %>% 
  head(., round(2), 10)

# [1] 779.4453 820.8530

The code above gives the same wrong output. This is because the round() function has two arguments. The first one is a vector and the second one is an integer which specifies to how many digits we want to round to after the comma.

Intuitively, a solution like the one below comes to mind. However, this code throws an error. This is because the last row reads head(dplyr::pull(gdpPercap), round(dplyr::pull(gdpPercap), 2), 10).

gapminder %>%
  dplyr::filter(continent == "Asia") %>%
  dplyr::pull(gdpPercap) %>% 
  head(round(., 2), 10)

Remember, the head() function only takes in two arguments. That is a vector and an integer which specifies how many instances we should show from the specified vector head(vector, integer). What we specified in the above code however, is head(vector, vector, integer). So in order to make the code from above work, we have to do head(round(dplyr::pull(gdpPercap)), 10). In order to achieve that, we have to change magrittr’s piping behaviour by suppressing it to place the left-hand side into the right-hand side as first argument. The code below shows how to do that.

gapminder %>%
  dplyr::filter(continent == "Asia") %>%
  dplyr::pull(gdpPercap) %>% 
  {head(round(., 2), 10)}

# [1] 779.45 820.85 853.10 836.20 739.98 786.11 978.01 852.40 649.34 635.34

all_equal(
  gapminder %>%
    dplyr::filter(continent == "Asia") %>%
    dplyr::pull(gdpPercap) %>% 
    round(2) %>% 
    head(10),

  gapminder %>%
    dplyr::filter(continent == "Asia") %>%
    dplyr::pull(gdpPercap) %>% 
    {head(round(., 2), 10)}
)

[1] TRUE

With help of the curly brackets, we avoided that dplyr::pull(gdpPercap) gets placed into the head() function as its first argument.

In contrast, here is an example, where we do not want to suppress the default behaviour where the left hand side is being placed as the first argument to the right-hand side.

gapminder %>%
  dplyr::filter(continent == "Asia", year > 1990) %>% 
  dplyr::group_by(country) %>%
  dplyr::do(head(., 2))

The dplyr function do() is taking in a data frame as its first argument and as its second argument an expression. This expression is then applied to all groups (all countries in Asia). So for the do() function we do not need curly brackets.

What if we wanted to know the correlation of life expectancy and GDP per capita. One way to do that would be with the code below.

gapminder %>%
  dplyr::filter(continent == "Asia") %>% 
  {stats::cor(.$lifeExp, .$gdpPercap)}

# Equivalent to the code above
gapminder %>%
  dplyr::filter(continent == "Asia") -> only_asia
stats::cor(only_asia$lifeExp, only_asia$gdpPercap)

gapminder %>% dplyr::filter(continent == “Asia”) produces a data frame. If you have used a bit of R before, then you are probably familiar with the following notation data_frame$column which in this case extracts a column from a data frame.

In the code above we did cor(data_frame$lifeExp, data_frame$gdpPercap), where the dot placeholder acted as a data frame. Again we put the curly brackets around the function call in order to suppress the default behaviour of the pipe operator, which places the data frame as the first argument into the cor() function. If we had left out the curly brackets, the code above would have thrown an error and the function would read cor(data_frame, data_frame$lifeExp, data_frame$gdpPercap). This is wrong because the cor() function only takes two arguments as inputs.

In the code below, we do want to split the data frame by its continents. So, we want to get a list of data frames.

gapminder %>%
  split(.$continent)

# the code above is equivalent to 
gapminder %>%
  split(., .$continent)

# the code above is equivalent to
split(gapminder, gapminder$continent)

Here, we want to have the first argument in the right-hand side to be the left-hand side and therefore, we do not need the curly brackets.

Some Simple Placeholder Examples

To get more familiar with how the placeholder works, here are some short examples.

gapminder %>%
  dplyr::pull(continent) %>%
  gsub("Europe", "EUROPE", .) %>%
  head(20)

# [1] "Asia"   "Asia"   "Asia"   "Asia"   "Asia"   "Asia"   "Asia"   "Asia"  
# [9] "Asia"   "Asia"   "Asia"   "Asia"   "EUROPE" "EUROPE" "EUROPE" "EUROPE"
# [17] "EUROPE" "EUROPE" "EUROPE" "EUROPE"

The gsub() function does not take in a vector, in our case the vector dplyr::pull(continent), and hence, we have to specify the placeholder in the gsub() function as its third argument.

1:5 %>%
  paste(., letters[.])

# [1] "1 a" "2 b" "3 c" "4 d" "5 e"

Instead of pulling a single column out of a data frame with the pull() function, we could have also done it with the code below.

gapminder %>%
  .[["gdpPercap"]] %>%
  head()

# The code above is similar to
gapminder[c(1:6), "gdpPercap"]

Another useful example is maggrittr‘s placeholder in combination with the ifelse() function and mutate().

Let’s suppose we wanted to categorize continents into the western hemisphere.

gapminder <- gapminder %>%
  mutate_if(is.factor, as.character)

gapminder %>%
  dplyr::mutate(continent = ifelse(.$continent == "Americas", "Western Hemisphere", .$continent))

What we did above was testing if the continent is America, and if it is America, then change it to Western Hemisphere. If it is not America leave it as it was before with .$continent.

Dollar Sign ($), dplyr’s group_by(), the Pipe and Placeholders

There is one last thing we need to talk about when using the pipe operator. This is when we are using it in combination with dplyr’s group_by() function and the pipe operator.

Let’s suppose we wanted to get the overall proportion of countries in each year and in each continent. Intuitively we would do that with the code below.

gapminder %>%
  dplyr::group_by(., continent, year) %>%
  dplyr::summarise(., count = n()) %>%
  dplyr::mutate(., total = sum(count),
                prop = count / total) %>%
  head()

# continent     year    count   total   prop
# Africa	1952	52	624	0.08333333
# Africa	1957	52	624	0.08333333
# Africa	1962	52	624	0.08333333
# Africa	1967	52	624	0.08333333
# Africa	1972	52	624	0.08333333
# Africa	1977	52	624	0.08333333

However, this would not give us the desired results. As we can see, the total column corresponds to the number of countries within Africa from 1952 until 2007. What we wanted is to divide the count by the total number of countries in every continent from 1952 until 2007.

gapminder %>%
  dplyr::group_by(., continent, year) %>%
  dplyr::summarise(., count = n()) %>%
  dplyr::mutate(., total = sum(.$count),
                prop = count / total) %>%
  head()

# continent     year    count   total   prop
# Africa	1952	52	1704	0.03051643
# Africa	1957	52	1704	0.03051643
# Africa	1962	52	1704	0.03051643
# Africa	1967	52	1704	0.03051643
# Africa	1972	52	1704	0.03051643
# Africa	1977	52	1704	0.03051643

The dollar sign in front of the count let us use the count of the initial gapminder data frame and did not use the grouped data frames. Keep this example in mind whenever you are grouping by some variables and want to divide the counts by the total (total number of rows of original data frame) instead of the total number of rows of each individual grouped data frame created by dplyr.

It is not easy to get your head around magrittr’s pipe when you first see it. Another challenge is to exactly know where to place placeholders appropriately. This also requires a good understanding of the functions you are using in combination with the pipe. All of the tidyverse packages work really well with magrittr’s pipe because they always take a data frame as their first argument. Therefore, there is barely any need to think about placeholders.

I hope you have enjoyed this quick introduction to the magrittr’s pipe operator. If you have any questions or suggestions, please let me know in the comments below.

Tags: data manipulation magrittr pipe R tidyverse

Comments (3)

Diem Nguyen says:

July 19, 2019 at 9:16 am

Hi Pascal, thanks a lot for thorough explanation abt dplyr and pipe. i’ve been using dplyr for 10 months but never know the underlying mechanisms behind the code until now. Your sharing is really amazing. Thanks!

1. Pascal Schmidt says:
  
  July 19, 2019 at 1:37 pm
  
  Hi Diem,
  
  Thank you for your comment. For the longest time I also just used the pipe without knowing how it works and even thought it is only a part of the dplyr package.
  
  It was not until I read magrittr’s documentation thoroughly that I understood the power of the pipe.
  
  Keep up the work!
  
Stewart Li says:

February 19, 2021 at 1:14 am

Thank you very much for your posts. They are really helpful.