7 Tidyverse Tricks for Getting Your Data Into the Right Shape
April 28, 2020 By Pascal Schmidt R Tidyverse Tutorial
The tidyverse
is the best package in R for data cleaning and data munging in my opinion. Because it is an opinionated collection of packages, using the tidyverse
becomes very intuitive after you have worked with it for some time. Knowing the ins and outs of the tidyverse
is almost impossible. Therefore, I am going to share some tips and tricks I have learned recently that make your code more readable and help with your data manipulations.
Let’s jump into it and see the 7 tidyverse
tricks for getting your data into the right shape.
Trick 1: Use count Instead of group_by + summarize(n = n())
Let us use the starwars
data set in the dplyr
package.
head(starwars) ## # A tibble: 6 x 13 ## name height mass hair_color skin_color eye_color birth_year gender homeworld ## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> ## 1 Luke~ 172 77 blond fair blue 19 male Tatooine ## 2 C-3PO 167 75 <NA> gold yellow 112 <NA> Tatooine ## 3 R2-D2 96 32 <NA> white, bl~ red 33 <NA> Naboo ## 4 Dart~ 202 136 none white yellow 41.9 male Tatooine ## 5 Leia~ 150 49 brown light brown 19 female Alderaan ## 6 Owen~ 178 120 brown, gr~ light blue 52 male Tatooine ## # ... with 4 more variables: species <chr>, films <list>, vehicles <list>, ## # starships <list>
Now if we wanted to count the number of Star Wars characters with the same hair, skin, and eye color we can do:
starwars %>% dplyr::group_by(hair_color, skin_color, eye_color) %>% dplyr::summarise(n = dplyr::n()) %>% dplyr::arrange(desc(n)) ## # A tibble: 67 x 4 ## # Groups: hair_color, skin_color [50] ## hair_color skin_color eye_color n ## <chr> <chr> <chr> <int> ## 1 brown light brown 6 ## 2 brown fair blue 4 ## 3 none grey black 4 ## 4 black dark brown 3 ## 5 blond fair blue 3 ## 6 black fair brown 2 ## 7 black tan brown 2 ## 8 black yellow blue 2 ## 9 brown fair brown 2 ## 10 none white yellow 2 ## # ... with 57 more rows
In order to shorten the code, we can do:
starwars %>% dplyr::count(hair_color, skin_color, eye_color, sort = TRUE) ## # A tibble: 67 x 4 ## hair_color skin_color eye_color n ## <chr> <chr> <chr> <int> ## 1 brown light brown 6 ## 2 brown fair blue 4 ## 3 none grey black 4 ## 4 black dark brown 3 ## 5 blond fair blue 3 ## 6 black fair brown 2 ## 7 black tan brown 2 ## 8 black yellow blue 2 ## 9 brown fair brown 2 ## 10 none white yellow 2 ## # ... with 57 more rows
What took us 3 lines of code can be done in 1 line.
Trick 2: Use purrr::when() for Control Flow Instead of if Statements
Another way to write if statements if R is with the when()
function from the purrr
package. The advantage of it is that we can easily integrate it with the pipe %>%
and do not have to break the flow of our code.
column_of_interest <- "vehicles" starwars %>% # if statement purrr::when(column_of_interest == "films" ~ dplyr::select(., column_of_interest) %>% dplyr::pull() %>% purrr::map(~ length(.)), # else if statement column_of_interest == "vehicles" ~ dplyr::select(., column_of_interest) %>% dplyr::pull() %>% purrr::map(~ length(.)), # else if statement column_of_interest == "starships" ~ dplyr::select(., column_of_interest) %>% dplyr::pull() %>% purrr::map(~ length(.)), # else statement ~ .) ## [[1]] ## [1] 2 ## ## [[2]] ## [1] 0 ## ## [[3]] ## [1] 0 ## ## [[4]] ## [1] 0 ## ## [[5]] ## [1] 1 ## ## [[6]] ## [1] 0 . . . .
So the code above counts the number of films, vehicles or starships for each character in Star Wars, depending on the input of column_of_interest
. If column_of_interest
does not match any of the 3 columns, just the data frame in going to be returned. We can easily continue with the pipe and the flow of the code is not being destroyed by annoying if
, else if()
, and else
statements.
Trick 3: Use tidyverse Variants Instead of Tidy Evaluation
Tidy evaluation is a blessing and a curse in R. As far as I am aware, R is the only language that does something like tidy evaluation. If you are unaware of what tidy evaluation is, it is basically unqoting character strings so, they work in functions. Tidy evaluation is convenient when it comes to developing functions or writing Shiny applications where the input from the user changes. Let me give you an example: Let’s assume we want to group by one or more variables in the Star Wars data set. One way to do that is shown in the code below:
groups <- c("hair_color", "skin_color", "eye_color") starwars %>% dplyr::group_by(!!!syms(groups)) %>% dplyr::summarise(n = dplyr::n()) %>% dplyr::arrange(desc(n)) ## # A tibble: 67 x 4 ## # Groups: hair_color, skin_color [50] ## hair_color skin_color eye_color n ## <chr> <chr> <chr> <int> ## 1 brown light brown 6 ## 2 brown fair blue 4 ## 3 none grey black 4 ## 4 black dark brown 3 ## 5 blond fair blue 3 ## 6 black fair brown 2 ## 7 black tan brown 2 ## 8 black yellow blue 2 ## 9 brown fair brown 2 ## 10 none white yellow 2 ## # ... with 57 more rows
What the code above basically does is it unquotes the strings in the groups
vector, so it can be used in the code that follows (I know that is a very very lose explanation). Tidy evaluation comes in handy when we want to create a function where groups
is a function argument and the user has the chance to group the data frame by any column they want to. It also comes in handy for Shiny application where we are forces to use solely with strings. For example input$your_variable
.
However, we don’t have to always use tidy evaluation. A nice way out are the tidyverse
variants such as mutate_at
, group_by_at
, modify_if
.
With these variants, the code above would look like this:
groups <- c("hair_color", "skin_color", "eye_color") starwars %>% dplyr::group_by_at(vars(groups)) %>% dplyr::summarise(n = dplyr::n()) %>% dplyr::arrange(desc(n)) ## # A tibble: 67 x 4 ## # Groups: hair_color, skin_color [50] ## hair_color skin_color eye_color n ## <chr> <chr> <chr> <int> ## 1 brown light brown 6 ## 2 brown fair blue 4 ## 3 none grey black 4 ## 4 black dark brown 3 ## 5 blond fair blue 3 ## 6 black fair brown 2 ## 7 black tan brown 2 ## 8 black yellow blue 2 ## 9 brown fair brown 2 ## 10 none white yellow 2 ## # ... with 57 more rows
That’s a very easy way out and does not force you to learn tidy evaluation.
Trick 4: Working with Lists in the tidyverse
Let’s say we want to know if there is a difference in height, mass, or birth year for characters with different eye colors. To achieve that, we can perform a one way ANOVA test to see if there is a significant difference in height, mass, or birth year between characters with different eye colors.
starwars %>% dplyr::summarise_if(is.numeric, ~ list(stats::kruskal.test(. ~ eye_color))) ## # A tibble: 1 x 3 ## height mass birth_year ## <list> <list> <list> ## 1 <htest> <htest> <htest>
The test gives us back a list. In order to work with a list in a data frame, we wrap the list()
around the kruskal.test()
function.
Trick 5: Again These Goddamn Lists… and pluck
Working with lists can be nasty. However, the purrr
package does an excellent job of dealing with lists. To get the p-value out of all three columns, we can use the pluck
function. Let’s first examine one of the lists.
starwars %>% dplyr::summarise_if(is.numeric, ~ list(stats::kruskal.test(. ~ eye_color))) -> p # getting the p-value of the first variable p$height[[1]]$p.value ## [1] 0.8063219 # getting all p-values with pluck and map_dfr p %>% purrr::map_dfr(~ purrr::pluck(., 1, "p.value")) ## # A tibble: 1 x 3 ## height mass birth_year ## <dbl> <dbl> <dbl> ## 1 0.806 0.357 0.388
Excellent! So we are looping over the columns with map_dfr()
and then get the list element with 1 and then the p-value with “p.value”. The pluck
function is excellent for deeply nested lists. By the way, it looks like eye color categories do not have a big effect on height, mass, or birth year and are very similar across all categories.
Trick 6: Magrittr’s pipe %>%
Yes, of course, we all know the pipe. However, it is not what I want to primarily discuss. It is rather these curly braces {}. In my opinion, the pipe falls short when it comes to teaching the tidyverse
. During my time in university and when I am tutoring students, the lecture notes are using the pipe but the only comment on them is that they chain together your code. Never is there any discussion about how they work, which I find very odd. If you want to learn about the pipe and the dot placeholder more in-depth, check out my blog post about it.
So let’s discuss these curly braces {}… If we wanted to see what the correlation between height and weight is, we can use the pipe like this:
starwars %>% { cor(.$height, .$mass, use = "pairwise.complete.obs") } ## [1] 0.1338842
If we would not use the curly braces, the pipe passes whatever came before the pipe, as the first argument to whatever function comes after the pipe. So in our case, without the braces, our code looks would read like this: cor(., .$height, .$mass, use = “pairwise.complete.obs”)
, where the dot placeholder .
, is whatever came before the pipe (the Star Wars data frame in our case).
Trick 7: Rename variables inside the select Function
When we want to rename variables, there are a couple of ways how we can do that. One tidyverse
way would be to use the rename
function from the dplyr
package.
starwars %>% dplyr::rename(Character = name) ## # A tibble: 87 x 13 ## Character height mass hair_color skin_color eye_color birth_year gender ## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> ## 1 Luke Sky~ 172 77 blond fair blue 19 male ## 2 C-3PO 167 75 <NA> gold yellow 112 <NA> ## 3 R2-D2 96 32 <NA> white, bl~ red 33 <NA> ## 4 Darth Va~ 202 136 none white yellow 41.9 male ## 5 Leia Org~ 150 49 brown light brown 19 female ## 6 Owen Lars 178 120 brown, gr~ light blue 52 male ## 7 Beru Whi~ 165 75 brown light blue 47 female ## 8 R5-D4 97 32 <NA> white, red red NA <NA> ## 9 Biggs Da~ 183 84 black light brown 24 male ## 10 Obi-Wan ~ 182 77 auburn, w~ fair blue-gray 57 male ## # ... with 77 more rows, and 5 more variables: homeworld <chr>, species <chr>, ## # films <list>, vehicles <list>, starships <list>
or…
starwars %>% dplyr::select(Character = name, dplyr::everything()) ## # A tibble: 87 x 13 ## Character height mass hair_color skin_color eye_color birth_year gender ## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> ## 1 Luke Sky~ 172 77 blond fair blue 19 male ## 2 C-3PO 167 75 <NA> gold yellow 112 <NA> ## 3 R2-D2 96 32 <NA> white, bl~ red 33 <NA> ## 4 Darth Va~ 202 136 none white yellow 41.9 male ## 5 Leia Org~ 150 49 brown light brown 19 female ## 6 Owen Lars 178 120 brown, gr~ light blue 52 male ## 7 Beru Whi~ 165 75 brown light blue 47 female ## 8 R5-D4 97 32 <NA> white, red red NA <NA> ## 9 Biggs Da~ 183 84 black light brown 24 male ## 10 Obi-Wan ~ 182 77 auburn, w~ fair blue-gray 57 male ## # ... with 77 more rows, and 5 more variables: homeworld <chr>, species <chr>, ## # films <list>, vehicles <list>, starships <list>
The first line of code is probably a better choice. However, if you are selecting variables anyway and want to rename some during the process, the code above is for you.
These were the 7 tidyverse
tricks that get your data into the right shape.
I hope you have enjoyed this blog post and if you know of some handy tidyverse
tricks you have learned, then let me know in the comments below.
Recent Posts
Recent Comments
- Kardiana on The Lasso – R Tutorial (Part 3)
- Pascal Schmidt on RSelenium Tutorial: A Tutorial to Basic Web Scraping With RSelenium
- Pascal Schmidt on Dynamic Tabs, insertTab, and removeTab For More efficient R Shiny Applications
- Gisa on Persistent Data Storage With a MySQL Database in R Shiny – An Example App
- Nicholas on Dynamic Tabs, insertTab, and removeTab For More efficient R Shiny Applications
Comments (2)
Hi, I mentioned your post in my recent article on data automation with R: https://rafabelokurows.medium.com/automate-the-heck-out-of-your-boring-manual-data-processes-ae82c371b1d9
If you could check it out, I would be honored. Thanks!
Great post! Thanks for the mention.