7 Tidyverse Tricks for Getting Your Data Into the Right Shape

The tidyverse is the best package in R for data cleaning and data munging in my opinion. Because it is an opinionated collection of packages, using the tidyverse becomes very intuitive after you have worked with it for some time. Knowing the ins and outs of the tidyverse is almost impossible. Therefore, I am going to share some tips and tricks I have learned recently that make your code more readable and help with your data manipulations.

Let’s jump into it and see the 7 tidyverse tricks for getting your data into the right shape.

Trick 1: Use count Instead of group_by + summarize(n = n())

Let us use the starwars data set in the dplyr package.

head(starwars)

## # A tibble: 6 x 13
##   name  height  mass hair_color skin_color eye_color birth_year gender homeworld
##   <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr>  <chr>    
## 1 Luke~    172    77 blond      fair       blue            19   male   Tatooine 
## 2 C-3PO    167    75 <NA>       gold       yellow         112   <NA>   Tatooine 
## 3 R2-D2     96    32 <NA>       white, bl~ red             33   <NA>   Naboo    
## 4 Dart~    202   136 none       white      yellow          41.9 male   Tatooine 
## 5 Leia~    150    49 brown      light      brown           19   female Alderaan 
## 6 Owen~    178   120 brown, gr~ light      blue            52   male   Tatooine 
## # ... with 4 more variables: species <chr>, films <list>, vehicles <list>,
## #   starships <list>

Now if we wanted to count the number of Star Wars characters with the same hair, skin, and eye color we can do:

starwars %>%
  dplyr::group_by(hair_color, skin_color, eye_color) %>%
  dplyr::summarise(n = dplyr::n()) %>%
  dplyr::arrange(desc(n)) 

## # A tibble: 67 x 4
## # Groups:   hair_color, skin_color [50]
##    hair_color skin_color eye_color     n
##    <chr>      <chr>      <chr>     <int>
##  1 brown      light      brown         6
##  2 brown      fair       blue          4
##  3 none       grey       black         4
##  4 black      dark       brown         3
##  5 blond      fair       blue          3
##  6 black      fair       brown         2
##  7 black      tan        brown         2
##  8 black      yellow     blue          2
##  9 brown      fair       brown         2
## 10 none       white      yellow        2
## # ... with 57 more rows

In order to shorten the code, we can do:

starwars %>%
  dplyr::count(hair_color, skin_color, eye_color, sort = TRUE)

## # A tibble: 67 x 4
##    hair_color skin_color eye_color     n
##    <chr>      <chr>      <chr>     <int>
##  1 brown      light      brown         6
##  2 brown      fair       blue          4
##  3 none       grey       black         4
##  4 black      dark       brown         3
##  5 blond      fair       blue          3
##  6 black      fair       brown         2
##  7 black      tan        brown         2
##  8 black      yellow     blue          2
##  9 brown      fair       brown         2
## 10 none       white      yellow        2
## # ... with 57 more rows

What took us 3 lines of code can be done in 1 line.

Trick 2: Use purrr::when() for Control Flow Instead of if Statements

Another way to write if statements if R is with the when() function from the purrr package. The advantage of it is that we can easily integrate it with the pipe %>% and do not have to break the flow of our code.

column_of_interest <- "vehicles"
starwars %>%
              # if statement 
  purrr::when(column_of_interest == "films" ~ dplyr::select(., column_of_interest) %>%
                dplyr::pull() %>%
                purrr::map(~ length(.)),
              
              # else if statement 
              column_of_interest == "vehicles" ~ dplyr::select(., column_of_interest) %>%
                dplyr::pull() %>%
                purrr::map(~ length(.)),
              
              # else if statement 
              column_of_interest == "starships" ~ dplyr::select(., column_of_interest) %>%
                dplyr::pull() %>%
                purrr::map(~ length(.)),
              
              # else statement
              ~ .)

## [[1]]
## [1] 2
## 
## [[2]]
## [1] 0
## 
## [[3]]
## [1] 0
## 
## [[4]]
## [1] 0
## 
## [[5]]
## [1] 1
## 
## [[6]]
## [1] 0
.
.
.
.

So the code above counts the number of films, vehicles or starships for each character in Star Wars, depending on the input of column_of_interest. If column_of_interest does not match any of the 3 columns, just the data frame in going to be returned. We can easily continue with the pipe and the flow of the code is not being destroyed by annoying if, else if(), and else statements.

Trick 3: Use tidyverse Variants Instead of Tidy Evaluation

Tidy evaluation is a blessing and a curse in R. As far as I am aware, R is the only language that does something like tidy evaluation. If you are unaware of what tidy evaluation is, it is basically unqoting character strings so, they work in functions. Tidy evaluation is convenient when it comes to developing functions or writing Shiny applications where the input from the user changes. Let me give you an example: Let’s assume we want to group by one or more variables in the Star Wars data set. One way to do that is shown in the code below:

groups <- c("hair_color", "skin_color", "eye_color")

starwars %>%
  dplyr::group_by(!!!syms(groups)) %>%
  dplyr::summarise(n = dplyr::n()) %>%
  dplyr::arrange(desc(n))

## # A tibble: 67 x 4
## # Groups:   hair_color, skin_color [50]
##    hair_color skin_color eye_color     n
##    <chr>      <chr>      <chr>     <int>
##  1 brown      light      brown         6
##  2 brown      fair       blue          4
##  3 none       grey       black         4
##  4 black      dark       brown         3
##  5 blond      fair       blue          3
##  6 black      fair       brown         2
##  7 black      tan        brown         2
##  8 black      yellow     blue          2
##  9 brown      fair       brown         2
## 10 none       white      yellow        2
## # ... with 57 more rows

What the code above basically does is it unquotes the strings in the groups vector, so it can be used in the code that follows (I know that is a very very lose explanation). Tidy evaluation comes in handy when we want to create a function where groups is a function argument and the user has the chance to group the data frame by any column they want to. It also comes in handy for Shiny application where we are forces to use solely with strings. For example input$your_variable.

However, we don’t have to always use tidy evaluation. A nice way out are the tidyverse variants such as mutate_at, group_by_at, modify_if.

With these variants, the code above would look like this:

groups <- c("hair_color", "skin_color", "eye_color")

starwars %>%
  dplyr::group_by_at(vars(groups)) %>%
  dplyr::summarise(n = dplyr::n()) %>%
  dplyr::arrange(desc(n))

## # A tibble: 67 x 4
## # Groups:   hair_color, skin_color [50]
##    hair_color skin_color eye_color     n
##    <chr>      <chr>      <chr>     <int>
##  1 brown      light      brown         6
##  2 brown      fair       blue          4
##  3 none       grey       black         4
##  4 black      dark       brown         3
##  5 blond      fair       blue          3
##  6 black      fair       brown         2
##  7 black      tan        brown         2
##  8 black      yellow     blue          2
##  9 brown      fair       brown         2
## 10 none       white      yellow        2
## # ... with 57 more rows

That’s a very easy way out and does not force you to learn tidy evaluation.

Trick 4: Working with Lists in the tidyverse

Let’s say we want to know if there is a difference in height, mass, or birth year for characters with different eye colors. To achieve that, we can perform a one way ANOVA test to see if there is a significant difference in height, mass, or birth year between characters with different eye colors.

starwars %>%
  dplyr::summarise_if(is.numeric, 
                      ~ list(stats::kruskal.test(. ~ eye_color)))

## # A tibble: 1 x 3
##   height  mass    birth_year
##   <list>  <list>  <list>    
## 1 <htest> <htest> <htest>

The test gives us back a list. In order to work with a list in a data frame, we wrap the list() around the kruskal.test() function.

Trick 5: Again These Goddamn Lists… and `pluck`

Working with lists can be nasty. However, the purrr package does an excellent job of dealing with lists. To get the p-value out of all three columns, we can use the pluck function. Let’s first examine one of the lists.

starwars %>%
  dplyr::summarise_if(is.numeric, ~ list(stats::kruskal.test(. ~ eye_color))) -> p

# getting the p-value of the first variable
p$height[[1]]$p.value
## [1] 0.8063219

# getting all p-values with pluck and map_dfr
p %>% 
  purrr::map_dfr(~ purrr::pluck(., 1, "p.value"))
## # A tibble: 1 x 3
##   height  mass birth_year
##    <dbl> <dbl>      <dbl>
## 1  0.806 0.357      0.388

Excellent! So we are looping over the columns with map_dfr() and then get the list element with 1 and then the p-value with “p.value”. The pluck function is excellent for deeply nested lists. By the way, it looks like eye color categories do not have a big effect on height, mass, or birth year and are very similar across all categories.

Trick 6: Magrittr’s pipe %>%

Yes, of course, we all know the pipe. However, it is not what I want to primarily discuss. It is rather these curly braces {}. In my opinion, the pipe falls short when it comes to teaching the tidyverse. During my time in university and when I am tutoring students, the lecture notes are using the pipe but the only comment on them is that they chain together your code. Never is there any discussion about how they work, which I find very odd. If you want to learn about the pipe and the dot placeholder more in-depth, check out my blog post about it.

So let’s discuss these curly braces {}… If we wanted to see what the correlation between height and weight is, we can use the pipe like this:

starwars %>%
{ cor(.$height, .$mass, use = "pairwise.complete.obs") }

## [1] 0.1338842

If we would not use the curly braces, the pipe passes whatever came before the pipe, as the first argument to whatever function comes after the pipe. So in our case, without the braces, our code looks would read like this: cor(., .$height, .$mass, use = “pairwise.complete.obs”), where the dot placeholder ., is whatever came before the pipe (the Star Wars data frame in our case).

Trick 7: Rename variables inside the select Function

When we want to rename variables, there are a couple of ways how we can do that. One tidyverse way would be to use the rename function from the dplyr package.

starwars %>%
  dplyr::rename(Character = name)

## # A tibble: 87 x 13
##    Character height  mass hair_color skin_color eye_color birth_year gender
##    <chr>      <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> 
##  1 Luke Sky~    172    77 blond      fair       blue            19   male  
##  2 C-3PO        167    75 <NA>       gold       yellow         112   <NA>  
##  3 R2-D2         96    32 <NA>       white, bl~ red             33   <NA>  
##  4 Darth Va~    202   136 none       white      yellow          41.9 male  
##  5 Leia Org~    150    49 brown      light      brown           19   female
##  6 Owen Lars    178   120 brown, gr~ light      blue            52   male  
##  7 Beru Whi~    165    75 brown      light      blue            47   female
##  8 R5-D4         97    32 <NA>       white, red red             NA   <NA>  
##  9 Biggs Da~    183    84 black      light      brown           24   male  
## 10 Obi-Wan ~    182    77 auburn, w~ fair       blue-gray       57   male  
## # ... with 77 more rows, and 5 more variables: homeworld <chr>, species <chr>,
## #   films <list>, vehicles <list>, starships <list>

or…

starwars %>%
  dplyr::select(Character = name, dplyr::everything())

## # A tibble: 87 x 13
##    Character height  mass hair_color skin_color eye_color birth_year gender
##    <chr>      <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> 
##  1 Luke Sky~    172    77 blond      fair       blue            19   male  
##  2 C-3PO        167    75 <NA>       gold       yellow         112   <NA>  
##  3 R2-D2         96    32 <NA>       white, bl~ red             33   <NA>  
##  4 Darth Va~    202   136 none       white      yellow          41.9 male  
##  5 Leia Org~    150    49 brown      light      brown           19   female
##  6 Owen Lars    178   120 brown, gr~ light      blue            52   male  
##  7 Beru Whi~    165    75 brown      light      blue            47   female
##  8 R5-D4         97    32 <NA>       white, red red             NA   <NA>  
##  9 Biggs Da~    183    84 black      light      brown           24   male  
## 10 Obi-Wan ~    182    77 auburn, w~ fair       blue-gray       57   male  
## # ... with 77 more rows, and 5 more variables: homeworld <chr>, species <chr>,
## #   films <list>, vehicles <list>, starships <list>

The first line of code is probably a better choice. However, if you are selecting variables anyway and want to rename some during the process, the code above is for you.

These were the 7 tidyverse tricks that get your data into the right shape.

I hope you have enjoyed this blog post and if you know of some handy tidyverse tricks you have learned, then let me know in the comments below.

Tags: data manipulation programming R tidyverse

Comments (2)

Rafael Belokurows says:

May 31, 2022 at 11:28 am

Hi, I mentioned your post in my recent article on data automation with R: https://rafabelokurows.medium.com/automate-the-heck-out-of-your-boring-manual-data-processes-ae82c371b1d9
If you could check it out, I would be honored. Thanks!

1. Pascal Schmidt says:
  
  August 22, 2022 at 3:32 pm
  
  Great post! Thanks for the mention.