Tidyr’s pivot_longer() and pivot_wider() Examples From the #TidyTuesday Challenge

March 28, 2020 By Pascal Schmidt #TidyTuesday R

In today’s blog post, we will be talking about tidyrs pivot_wider() and pivot_longer() functions in the context of the brain injuries #TidyTuesday challenge.

The two functions above have been developed to make it easier for R users/programmers to have a better mental model about the transformations on the data frames. Unlike gather and spread, pivot_longer() and pivot_wider() are the opposites of each other.

Let’s see a quick example and then dive into the #TidyTuesday’s brain injuries example.

Tidyr’s vignette about pivot_longer() and pivot_wider() can be found here.

What we will be covering:

An example of how pivot_longer() works.
An example of how pivot_wider() works from this week’s #TidyTuesday data set.

First, let’s start with a toy data set that illustrates these concepts.

pivot_wider Example

library(tidyverse)

year <- paste0(2001:2020)
team <- letters[1:length(year)]
num <- sample(c(1:100), size = length(year)^2, replace = TRUE)
df <- data.frame(matrix(num, nrow = length(year), ncol = length(year))) %>%
  purrr::set_names(year) %>%
  dplyr::bind_cols(data.frame(category = cat)) %>%
  dplyr::as_tibble()

## # A tibble: 20 x 21
##    2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011
##     <int>  <int>  <int>  <int>  <int>  <int>  <int>  <int>  <int>  <int>  <int>
##  1     31     26      7     41     52     79     24     83     88     77      7
##  2     79      7     53     47     22     85     87     89     54     46     58
##  3     51     42     27     90     89     37      7     39     75     70     61
##  4     14      9     96     60     32      8     93     54     48     72     74
##  5     67     83     38     95     25     51     79     90     20     44     24
##  6     42     36     89     16     87     74     23      9     67     32     63
##  7     50     78     34     94     35     50     26     71     93     36     54
##  8     43     81     93      6     40     98     32     98     36     45     23
##  9     14     43     69     72     30     74      7     48     52     14     26
## 10     25     76     72     86     12     86     27     77     22     16     33
## 11     90     15     76     86     31     76     42     83     49     87     57
## 12     91     32     63     39     30     84      5     56     42     33     29
## 13     69      7     13     31     64     46     70     39     59     40     10
## 14     91      9     82     81     99     17     16     68     84     40     53
## 15     57     41     97     50     14     62     24      1     11     10     54
## 16     92     74     91     34     93     46     32     40     55     89     77
## 17      9     23     25      4     96     54     21     30      8     72     11
## 18     93     27     38     13     71     35     55     94     46     82     25
## 19     99     60     21     69     67     94     75     89     85      9     52
## 20     72     53     79     25     23     79     36     16     66      7     26
## # ... with 10 more variables: 2012 <int>, 2013 <int>, 2014 <int>,
## #   2015 <int>, 2016 <int>, 2017 <int>, 2018 <int>, 2019 <int>,
## #   2020 <int>, category <fct>

Above, we have a data frame with teams from a to t and their corresponding points scored in the league in the years from 2001 to 2020. In order to make this data tidy, we are going to use the pivot_longer() function from the tidyr package, to “lengthen” the data frame.

Below, the first argument is the data frame and the second one is the columns we want to put into a longer format. We do not have to write these column names in quotation marks because they already exist in the data frame. However, we have to put backticks around them because they are numbers and hence, no valid columns in a data frame.

Next, we want to say that all the columns specified, (2001 – 2020) will be put in a column called year. This time we have to put it in quotation marks because the column does not exist yet.

Next, we put all the data stored in the columns 2001 – 2020 into another column called points. The resulting data frame is the one shown below.

pivot_longer(df, 2001:2020, names_to = "year", values_to = "points")

## # A tibble: 400 x 3
##    category year  points
##    <fct>    <chr>  <int>
##  1 a        2001      31
##  2 a        2002      26
##  3 a        2003       7
##  4 a        2004      41
##  5 a        2005      52
##  6 a        2006      79
##  7 a        2007      24
##  8 a        2008      83
##  9 a        2009      88
## 10 a        2010      77
## # ... with 390 more rows

pivot_longer()’s Advantage Over gather()

With the pivot_longer() function, we can also specify multiple names for the names_to argument. For example:

df <- structure(list(athlete = structure(c(5L, 4L, 1L, 3L, 6L, 2L), .Label = c("Billy Bob", 
"Harry Hat", "Jane Doe", "Joe Bloggs", "John Smith", "Sam Smith"
), class = "factor"), T1.HRmin = c(108L, 139L, 134L, 124L, 124L, 
116L), T1.HRmax = c(136L, 159L, 160L, 146L, 154L, 148L), T2.HRmin = c(137L, 
160L, 161L, 147L, 155L, 149L), T2.HRmax = c(149L, 175L, 178L, 
163L, 164L, 165L), T3.HRmin = c(150L, 176L, 179L, 164L, 165L, 
166L), T3.HRmax = c(161L, 189L, 194L, 172L, 180L, 177L), T4.HRmin = c(162L, 
190L, 195L, 173L, 181L, 178L), T4.HRmax = c(166L, 194L, 200L, 
177L, 186L, 182L), T5.HRmin = c(167L, 195L, 201L, 178L, 187L, 
183L), T5.HRmax = c(220L, 220L, 220L, 200L, 201L, 220L)), class = "data.frame", row.names = c(NA, 
-6L))

##      athlete T1.HRmin T1.HRmax T2.HRmin T2.HRmax T3.HRmin T3.HRmax T4.HRmin
## 1 John Smith      108      136      137      149      150      161      162
## 2 Joe Bloggs      139      159      160      175      176      189      190
## 3  Billy Bob      134      160      161      178      179      194      195
## 4   Jane Doe      124      146      147      163      164      172      173
## 5  Sam Smith      124      154      155      164      165      180      181
## 6  Harry Hat      116      148      149      165      166      177      178
##   T4.HRmax T5.HRmin T5.HRmax
## 1      166      167      220
## 2      194      195      220
## 3      200      201      220
## 4      177      178      200
## 5      186      187      201
## 6      182      183      220

We can see that the columns consist of multiple heart rate zones and HRmin and HRmax. In the names_to argument we can create multiple columns by creating a character vector and specifying the names_sep argument.

df %>% 
  tidyr::pivot_longer(cols = starts_with('T'), 
                      names_to = c('col', '.value'), 
                      names_sep = '\\.')

## # A tibble: 30 x 4
##    athlete    col   HRmin HRmax
##    <fct>      <chr> <int> <int>
##  1 John Smith T1      108   136
##  2 John Smith T2      137   149
##  3 John Smith T3      150   161
##  4 John Smith T4      162   166
##  5 John Smith T5      167   220
##  6 Joe Bloggs T1      139   159
##  7 Joe Bloggs T2      160   175
##  8 Joe Bloggs T3      176   189
##  9 Joe Bloggs T4      190   194
## 10 Joe Bloggs T5      195   220
## # ... with 20 more rows

Because we provided the special character .value sentinel in the names_to argument, we do not have to specify a values_to argument and thevalue column will be derived from the existing column (HRmin, HRmax). In order to do that, we have to specify the names_sep argument in the function and specify how we want to split up the column names.

The example above can be found on StackOverflow.

pivot_wider and tidytuesday Challenge

Now onto the tidytuesday challenge. The data sets can be found on its GitHub website or you can read in the data with the URL below.

tbi_age <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-03-24/tbi_age.csv')

My goal was to visualize the cause of brain injury by age group. However, the way the age_group column was structured annoyed me.

tbi_age %>%
  dplyr::group_by(age_group) %>%
  dplyr::summarise(sum_num = sum(number_est, na.rm = TRUE))

## # A tibble: 11 x 2
##    age_group sum_num
##    <chr>       <dbl>
##  1 0-17       836718
##  2 0-4        331403
##  3 15-24      476652
##  4 25-34      319303
##  5 35-44      240462
##  6 45-54      265417
##  7 5-14       348913
##  8 55-64      241511
##  9 65-74      211035
## 10 75+        442377
## 11 Total     2877539

We can see that there is some overlap between age groups. For example, a lot of numbers in the age group from 0-17 are included in the age group 0-4. Moreover, I did not like the grouping of age groups. Hence, I decided to use pivot_wider() to group the column into these age groups: 0-4, 0-17, 18-24…

The code below shows how I did that:

tbi_age %>%
  dplyr::select(type, injury_mechanism, age_group, number_est) %>%
  tidyr::pivot_wider(names_from = age_group, values_from = number_est)

## # A tibble: 21 x 13
##    type  injury_mechanism  0-17   0-4   5-14   15-24   25-34   35-44   45-54
##    <chr> <chr>             <dbl>  <dbl>  <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
##  1 Emer~ Motor Vehicle C~  47138   5464  19785  103892   71641   44108   40020
##  2 Emer~ Unintentional F~ 397190 230776 133084   96568   70210   68830   95127
##  3 Emer~ Unintentionally~ 229236  53436 120839  106679   44404   32479   30495
##  4 Emer~ Other unintenti~  55785  12007  30656   37118   22360   17541   17808
##  5 Emer~ Intentional sel~     NA     NA     NA     870     650     421     247
##  6 Emer~ Assault           24360    674   9690   65399   57213   34100   27682
##  7 Emer~ Other or no mec~  57983  19360  26022   33395   20974   16503   15962
##  8 Hosp~ Motor Vehicle C~   5830    870   2395   12925   11050    7305    8490
##  9 Hosp~ Unintentional F~   7935   4700   2270    3910    4470    5640   12010
## 10 Hosp~ Unintentionally~   1985    510    980    1070     635     610     685
## # ... with 11 more rows, and 4 more variables: 55-64 <dbl>, 65-74 <dbl>,
## #   75+ <dbl>, Total <dbl>

We take the names from the age_group column and then make these categories to column names. Then we take the values from the number_est column and put these values in the appropriate age group columns.

Now it is easy for me to use mutate() and construct my new categories.

tbi_age %>%
  dplyr::select(type, injury_mechanism, age_group, number_est) %>%
  tidyr::pivot_wider(names_from = age_group, values_from = number_est) %>%
  dplyr::mutate(15-17 = 0-17 - 0-4 - 5-14,
                5-17 = 0-17 - 0-4,
                18-24 = 15-24 - 15-17) %>%
  dplyr::select(-c(0-17,5-14, 15-24, 15-17))

## # A tibble: 21 x 12
##    type  injury_mechanism   0-4    25-34   35-44   45-54   55-64   65-74  75+
##    <chr> <chr>             <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>  <dbl>
##  1 Emer~ Motor Vehicle C~   5464   71641   44108   40020   27193   13829   8176
##  2 Emer~ Unintentional F~ 230776   70210   68830   95127  112460  120327 286031
##  3 Emer~ Unintentionally~  53436   44404   32479   30495   20408   11937  13270
##  4 Emer~ Other unintenti~  12007   22360   17541   17808   12928    7077   7440
##  5 Emer~ Intentional sel~     NA     650     421     247     105      NA     NA
##  6 Emer~ Assault             674   57213   34100   27682   11538    2893   1260
##  7 Emer~ Other or no mec~  19360   20974   16503   15962   13387   10051  17318
##  8 Hosp~ Motor Vehicle C~    870   11050    7305    8490    7280    4485   3965
##  9 Hosp~ Unintentional F~   4700    4470    5640   12010   18490   25235  74005
## 10 Hosp~ Unintentionally~    510     635     610     685     765     790   1045
## # ... with 11 more rows, and 3 more variables: Total <dbl>, 5-17 <dbl>,
## #   18-24 <dbl>

This is how the new table looks like. Now, as a sanity check, let’s sum up all the age groups and see if it going to match the total number.

tbi_age %>%
  dplyr::select(type, injury_mechanism, age_group, number_est) %>%
  tidyr::pivot_wider(names_from = age_group, values_from = number_est) %>%
  dplyr::mutate(15-17 = 0-17 - 0-4 - 5-14,
                5-17 = 0-17 - 0-4,
                18-24 = 15-24 - 15-17) %>%
  dplyr::select(-c(0-17,5-14, 15-24, 15-17)) %>%
  dplyr::mutate(Total = as.character(Total)) %>%
  dplyr::mutate(Total_check = purrr::pmap_dbl(dplyr::select_if(., is.numeric), sum, na.rm = TRUE)) %>%
  dplyr::select(Total, Total_check)


## # A tibble: 21 x 2
##    Total   Total_check
##    <chr>         <dbl>
##  1 334109       334108
##  2 1213412     1213413
##  3 433947       433947
##  4 164935       164935
##  5 2567           1423
##  6 210450       210449
##  7 172974       172972
##  8 58765         58765
##  9 150730       150730
## 10 7090           7090
## # ... with 11 more rows

In row 5 where the numbers do not match up at all, this is because we have NA values, this is because we have NA values for one of the injury categories. Other than that, everything is matching up just fine!

Now, I already mentioned that pivot_wider() and pivot_longer() are opposites. Hence, we can easily use pivot_longer() from tidyr to get the data back into its original shape again.

From pivot_wider() to pivot_longer()

tbi_age %>%
  dplyr::select(type, injury_mechanism, age_group, number_est) %>%
  tidyr::pivot_wider(names_from = age_group, values_from = number_est) %>%
  dplyr::mutate(15-17 = 0-17 - 0-4 - 5-14,
                5-17 = 0-17 - 0-4,
                18-24 = 15-24 - 15-17) %>%
  dplyr::select(-c(0-17,5-14, 15-24, 15-17)) %>%
  dplyr::mutate(Total = as.character(Total)) %>%
  dplyr::mutate(Total_check = purrr::pmap_dbl(dplyr::select_if(., is.numeric), sum, na.rm = TRUE)) %>%
  dplyr::select(-Total_check) %>%
  
  # convert back to double so all values in one column are of type double
  dplyr::mutate(Total = as.double(Total)) %>%
  tidyr::pivot_longer(0-4:18-24, names_to = "age_group", values_to = "num_est")

## # A tibble: 210 x 4
##    type                       injury_mechanism      age_group num_est
##    <chr>                      <chr>                 <chr>       <dbl>
##  1 Emergency Department Visit Motor Vehicle Crashes 0-4          5464
##  2 Emergency Department Visit Motor Vehicle Crashes 25-34       71641
##  3 Emergency Department Visit Motor Vehicle Crashes 35-44       44108
##  4 Emergency Department Visit Motor Vehicle Crashes 45-54       40020
##  5 Emergency Department Visit Motor Vehicle Crashes 55-64       27193
##  6 Emergency Department Visit Motor Vehicle Crashes 65-74       13829
##  7 Emergency Department Visit Motor Vehicle Crashes 75+          8176
##  8 Emergency Department Visit Motor Vehicle Crashes Total      334109
##  9 Emergency Department Visit Motor Vehicle Crashes 5-17        41674
## 10 Emergency Department Visit Motor Vehicle Crashes 18-24       82003
## # ... with 200 more rows

That was a short tutorial about pivot_wider() and pivot_longer() from the tidyr package with an example from this week’s #TidyTuesday challenge.

If you have any questions or suggestions, please let me know in the comments below.

More #TidyTuesday

Tags: data manipulation R tidyr