Tidyr’s pivot_longer() and pivot_wider() Examples From the #TidyTuesday Challenge
March 28, 2020 By Pascal Schmidt #TidyTuesday R
In today’s blog post, we will be talking about tidyr
s pivot_wider()
and pivot_longer()
functions in the context of the brain injuries #TidyTuesday challenge.
The two functions above have been developed to make it easier for R users/programmers to have a better mental model about the transformations on the data frames. Unlike gather and spread, pivot_longer()
and pivot_wider()
are the opposites of each other.
Let’s see a quick example and then dive into the #TidyTuesday’s brain injuries example.
Tidyr’s vignette about pivot_longer()
and pivot_wider()
can be found here.
What we will be covering:
- An example of how
pivot_longer()
works. - An example of how
pivot_wider()
works from this week’s #TidyTuesday data set.
First, let’s start with a toy data set that illustrates these concepts.
pivot_wider Example
library(tidyverse) year <- paste0(2001:2020) team <- letters[1:length(year)] num <- sample(c(1:100), size = length(year)^2, replace = TRUE) df <- data.frame(matrix(num, nrow = length(year), ncol = length(year))) %>% purrr::set_names(year) %>% dplyr::bind_cols(data.frame(category = cat)) %>% dplyr::as_tibble() ## # A tibble: 20 x 21 ##2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
## <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> ## 1 31 26 7 41 52 79 24 83 88 77 7 ## 2 79 7 53 47 22 85 87 89 54 46 58 ## 3 51 42 27 90 89 37 7 39 75 70 61 ## 4 14 9 96 60 32 8 93 54 48 72 74 ## 5 67 83 38 95 25 51 79 90 20 44 24 ## 6 42 36 89 16 87 74 23 9 67 32 63 ## 7 50 78 34 94 35 50 26 71 93 36 54 ## 8 43 81 93 6 40 98 32 98 36 45 23 ## 9 14 43 69 72 30 74 7 48 52 14 26 ## 10 25 76 72 86 12 86 27 77 22 16 33 ## 11 90 15 76 86 31 76 42 83 49 87 57 ## 12 91 32 63 39 30 84 5 56 42 33 29 ## 13 69 7 13 31 64 46 70 39 59 40 10 ## 14 91 9 82 81 99 17 16 68 84 40 53 ## 15 57 41 97 50 14 62 24 1 11 10 54 ## 16 92 74 91 34 93 46 32 40 55 89 77 ## 17 9 23 25 4 96 54 21 30 8 72 11 ## 18 93 27 38 13 71 35 55 94 46 82 25 ## 19 99 60 21 69 67 94 75 89 85 9 52 ## 20 72 53 79 25 23 79 36 16 66 7 26 ## # ... with 10 more variables:2012
<int>,2013
<int>,2014
<int>, ## #2015
<int>,2016
<int>,2017
<int>,2018
<int>,2019
<int>, ## #2020
<int>, category <fct>
Above, we have a data frame with teams from a to t and their corresponding points scored in the league in the years from 2001 to 2020. In order to make this data tidy, we are going to use the pivot_longer()
function from the tidyr
package, to “lengthen” the data frame.
Below, the first argument is the data frame and the second one is the columns we want to put into a longer format. We do not have to write these column names in quotation marks because they already exist in the data frame. However, we have to put backticks around them because they are numbers and hence, no valid columns in a data frame.
Next, we want to say that all the columns specified, (2001 – 2020) will be put in a column called year
. This time we have to put it in quotation marks because the column does not exist yet.
Next, we put all the data stored in the columns 2001 – 2020 into another column called points. The resulting data frame is the one shown below.
pivot_longer(df, 2001:2020, names_to = "year", values_to = "points") ## # A tibble: 400 x 3 ## category year points ## <fct> <chr> <int> ## 1 a 2001 31 ## 2 a 2002 26 ## 3 a 2003 7 ## 4 a 2004 41 ## 5 a 2005 52 ## 6 a 2006 79 ## 7 a 2007 24 ## 8 a 2008 83 ## 9 a 2009 88 ## 10 a 2010 77 ## # ... with 390 more rows
pivot_longer()’s Advantage Over gather()
With the pivot_longer()
function, we can also specify multiple names for the names_to
argument. For example:
df <- structure(list(athlete = structure(c(5L, 4L, 1L, 3L, 6L, 2L), .Label = c("Billy Bob", "Harry Hat", "Jane Doe", "Joe Bloggs", "John Smith", "Sam Smith" ), class = "factor"), T1.HRmin = c(108L, 139L, 134L, 124L, 124L, 116L), T1.HRmax = c(136L, 159L, 160L, 146L, 154L, 148L), T2.HRmin = c(137L, 160L, 161L, 147L, 155L, 149L), T2.HRmax = c(149L, 175L, 178L, 163L, 164L, 165L), T3.HRmin = c(150L, 176L, 179L, 164L, 165L, 166L), T3.HRmax = c(161L, 189L, 194L, 172L, 180L, 177L), T4.HRmin = c(162L, 190L, 195L, 173L, 181L, 178L), T4.HRmax = c(166L, 194L, 200L, 177L, 186L, 182L), T5.HRmin = c(167L, 195L, 201L, 178L, 187L, 183L), T5.HRmax = c(220L, 220L, 220L, 200L, 201L, 220L)), class = "data.frame", row.names = c(NA, -6L)) ## athlete T1.HRmin T1.HRmax T2.HRmin T2.HRmax T3.HRmin T3.HRmax T4.HRmin ## 1 John Smith 108 136 137 149 150 161 162 ## 2 Joe Bloggs 139 159 160 175 176 189 190 ## 3 Billy Bob 134 160 161 178 179 194 195 ## 4 Jane Doe 124 146 147 163 164 172 173 ## 5 Sam Smith 124 154 155 164 165 180 181 ## 6 Harry Hat 116 148 149 165 166 177 178 ## T4.HRmax T5.HRmin T5.HRmax ## 1 166 167 220 ## 2 194 195 220 ## 3 200 201 220 ## 4 177 178 200 ## 5 186 187 201 ## 6 182 183 220
We can see that the columns consist of multiple heart rate zones and HRmin
and HRmax
. In the names_to
argument we can create multiple columns by creating a character vector and specifying the names_sep
argument.
df %>% tidyr::pivot_longer(cols = starts_with('T'), names_to = c('col', '.value'), names_sep = '\\.') ## # A tibble: 30 x 4 ## athlete col HRmin HRmax ## <fct> <chr> <int> <int> ## 1 John Smith T1 108 136 ## 2 John Smith T2 137 149 ## 3 John Smith T3 150 161 ## 4 John Smith T4 162 166 ## 5 John Smith T5 167 220 ## 6 Joe Bloggs T1 139 159 ## 7 Joe Bloggs T2 160 175 ## 8 Joe Bloggs T3 176 189 ## 9 Joe Bloggs T4 190 194 ## 10 Joe Bloggs T5 195 220 ## # ... with 20 more rows
Because we provided the special character .value
sentinel in the names_to
argument, we do not have to specify a values_to
argument and thevalue
column will be derived from the existing column (HRmin
, HRmax
). In order to do that, we have to specify the names_sep
argument in the function and specify how we want to split up the column names.
The example above can be found on StackOverflow.
pivot_wider and tidytuesday Challenge
Now onto the tidytuesday challenge. The data sets can be found on its GitHub website or you can read in the data with the URL below.
tbi_age <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-03-24/tbi_age.csv')
My goal was to visualize the cause of brain injury by age group. However, the way the age_group
column was structured annoyed me.
tbi_age %>% dplyr::group_by(age_group) %>% dplyr::summarise(sum_num = sum(number_est, na.rm = TRUE)) ## # A tibble: 11 x 2 ## age_group sum_num ## <chr> <dbl> ## 1 0-17 836718 ## 2 0-4 331403 ## 3 15-24 476652 ## 4 25-34 319303 ## 5 35-44 240462 ## 6 45-54 265417 ## 7 5-14 348913 ## 8 55-64 241511 ## 9 65-74 211035 ## 10 75+ 442377 ## 11 Total 2877539
We can see that there is some overlap between age groups. For example, a lot of numbers in the age group from 0-17 are included in the age group 0-4. Moreover, I did not like the grouping of age groups. Hence, I decided to use pivot_wider()
to group the column into these age groups: 0-4, 0-17, 18-24…
The code below shows how I did that:
tbi_age %>% dplyr::select(type, injury_mechanism, age_group, number_est) %>% tidyr::pivot_wider(names_from = age_group, values_from = number_est) ## # A tibble: 21 x 13 ## type injury_mechanism 0-17 0-45-14
15-24 25-34 35-44 45-54 ## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 Emer~ Motor Vehicle C~ 47138 5464 19785 103892 71641 44108 40020 ## 2 Emer~ Unintentional F~ 397190 230776 133084 96568 70210 68830 95127 ## 3 Emer~ Unintentionally~ 229236 53436 120839 106679 44404 32479 30495 ## 4 Emer~ Other unintenti~ 55785 12007 30656 37118 22360 17541 17808 ## 5 Emer~ Intentional sel~ NA NA NA 870 650 421 247 ## 6 Emer~ Assault 24360 674 9690 65399 57213 34100 27682 ## 7 Emer~ Other or no mec~ 57983 19360 26022 33395 20974 16503 15962 ## 8 Hosp~ Motor Vehicle C~ 5830 870 2395 12925 11050 7305 8490 ## 9 Hosp~ Unintentional F~ 7935 4700 2270 3910 4470 5640 12010 ## 10 Hosp~ Unintentionally~ 1985 510 980 1070 635 610 685 ## # ... with 11 more rows, and 4 more variables:55-64
<dbl>,65-74
<dbl>, ## #75+
<dbl>, Total <dbl>
We take the names from the age_group column and then make these categories to column names. Then we take the values from the number_est
column and put these values in the appropriate age group columns.
Now it is easy for me to use mutate()
and construct my new categories.
tbi_age %>% dplyr::select(type, injury_mechanism, age_group, number_est) %>% tidyr::pivot_wider(names_from = age_group, values_from = number_est) %>% dplyr::mutate(15-17 = 0-17 - 0-4 - 5-14, 5-17 = 0-17 - 0-4, 18-24 = 15-24 - 15-17) %>% dplyr::select(-c(0-17,5-14, 15-24, 15-17)) ## # A tibble: 21 x 12 ## type injury_mechanism 0-4 25-34 35-44 45-54 55-64 65-74 75+ ## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 Emer~ Motor Vehicle C~ 5464 71641 44108 40020 27193 13829 8176 ## 2 Emer~ Unintentional F~ 230776 70210 68830 95127 112460 120327 286031 ## 3 Emer~ Unintentionally~ 53436 44404 32479 30495 20408 11937 13270 ## 4 Emer~ Other unintenti~ 12007 22360 17541 17808 12928 7077 7440 ## 5 Emer~ Intentional sel~ NA 650 421 247 105 NA NA ## 6 Emer~ Assault 674 57213 34100 27682 11538 2893 1260 ## 7 Emer~ Other or no mec~ 19360 20974 16503 15962 13387 10051 17318 ## 8 Hosp~ Motor Vehicle C~ 870 11050 7305 8490 7280 4485 3965 ## 9 Hosp~ Unintentional F~ 4700 4470 5640 12010 18490 25235 74005 ## 10 Hosp~ Unintentionally~ 510 635 610 685 765 790 1045 ## # ... with 11 more rows, and 3 more variables: Total <dbl>, 5-17 <dbl>, ## # 18-24 <dbl>
This is how the new table looks like. Now, as a sanity check, let’s sum up all the age groups and see if it going to match the total number.
tbi_age %>% dplyr::select(type, injury_mechanism, age_group, number_est) %>% tidyr::pivot_wider(names_from = age_group, values_from = number_est) %>% dplyr::mutate(15-17 = 0-17 - 0-4 - 5-14, 5-17 = 0-17 - 0-4, 18-24 = 15-24 - 15-17) %>% dplyr::select(-c(0-17,5-14, 15-24, 15-17)) %>% dplyr::mutate(Total = as.character(Total)) %>% dplyr::mutate(Total_check = purrr::pmap_dbl(dplyr::select_if(., is.numeric), sum, na.rm = TRUE)) %>% dplyr::select(Total, Total_check) ## # A tibble: 21 x 2 ## Total Total_check ## <chr> <dbl> ## 1 334109 334108 ## 2 1213412 1213413 ## 3 433947 433947 ## 4 164935 164935 ## 5 2567 1423 ## 6 210450 210449 ## 7 172974 172972 ## 8 58765 58765 ## 9 150730 150730 ## 10 7090 7090 ## # ... with 11 more rows
In row 5 where the numbers do not match up at all, this is because we have NA values, this is because we have NA
values for one of the injury categories. Other than that, everything is matching up just fine!
Now, I already mentioned that pivot_wider()
and pivot_longer()
are opposites. Hence, we can easily use pivot_longer()
from tidyr
to get the data back into its original shape again.
From pivot_wider() to pivot_longer()
tbi_age %>% dplyr::select(type, injury_mechanism, age_group, number_est) %>% tidyr::pivot_wider(names_from = age_group, values_from = number_est) %>% dplyr::mutate(15-17 = 0-17 - 0-4 - 5-14, 5-17 = 0-17 - 0-4, 18-24 = 15-24 - 15-17) %>% dplyr::select(-c(0-17,5-14, 15-24, 15-17)) %>% dplyr::mutate(Total = as.character(Total)) %>% dplyr::mutate(Total_check = purrr::pmap_dbl(dplyr::select_if(., is.numeric), sum, na.rm = TRUE)) %>% dplyr::select(-Total_check) %>% # convert back to double so all values in one column are of type double dplyr::mutate(Total = as.double(Total)) %>% tidyr::pivot_longer(0-4:18-24, names_to = "age_group", values_to = "num_est") ## # A tibble: 210 x 4 ## type injury_mechanism age_group num_est ## <chr> <chr> <chr> <dbl> ## 1 Emergency Department Visit Motor Vehicle Crashes 0-4 5464 ## 2 Emergency Department Visit Motor Vehicle Crashes 25-34 71641 ## 3 Emergency Department Visit Motor Vehicle Crashes 35-44 44108 ## 4 Emergency Department Visit Motor Vehicle Crashes 45-54 40020 ## 5 Emergency Department Visit Motor Vehicle Crashes 55-64 27193 ## 6 Emergency Department Visit Motor Vehicle Crashes 65-74 13829 ## 7 Emergency Department Visit Motor Vehicle Crashes 75+ 8176 ## 8 Emergency Department Visit Motor Vehicle Crashes Total 334109 ## 9 Emergency Department Visit Motor Vehicle Crashes 5-17 41674 ## 10 Emergency Department Visit Motor Vehicle Crashes 18-24 82003 ## # ... with 200 more rows
That was a short tutorial about pivot_wider()
and pivot_longer()
from the tidyr
package with an example from this week’s #TidyTuesday challenge.
If you have any questions or suggestions, please let me know in the comments below.
More #TidyTuesday
Recent Posts
Recent Comments
- Kardiana on The Lasso – R Tutorial (Part 3)
- Pascal Schmidt on RSelenium Tutorial: A Tutorial to Basic Web Scraping With RSelenium
- Pascal Schmidt on Dynamic Tabs, insertTab, and removeTab For More efficient R Shiny Applications
- Gisa on Persistent Data Storage With a MySQL Database in R Shiny – An Example App
- Nicholas on Dynamic Tabs, insertTab, and removeTab For More efficient R Shiny Applications
Comments (2)
Im here just to let you know how your tutorial saved my day hahaha Thanks A LOT <3
I am happy you found it useful 🙂