Rowwise Data Wrangling Example With purrr, pmap, and Others in R
May 20, 2019 By Pascal Schmidt programming R Tidyverse Tutorial
Last time, we talked about row-wise operations with purrr
and pmap()
after a colleague of mine got me thinking about row-wise operations in R.
In this post, I will be going over a small example data set which outlines the problem we wanted to solve.
First, the data set:
library(tidyverse) dat <- data.frame( "SCORE_2010" = c(0, 0, 1, 0), "SCORE_2011" = c(1, 0, 1, 0), "SCORE_2012" = c(0, 0, 1, 0), "SCORE_2013" = c(1, 0, 0, 1), "TAX_2010" = c(0, 1, 1, 0), "TAX_2011" = c(1, 0, 1, 0), "TAX_2012" = c(0, 0, 1, 0), "TAX_2013" = c(1, 0, 0, 1), "FISCAL_2010" = c(1, 1, 1, 1), "FISCAL_2011" = c(1, 0, 0, 0), "FISCAL_2012" = c(1, 0, 1, 0), "FISCAL_2013" = c(1, 1, 0, 1), "FAM_2010" = c(0, 1, 1, 0), "FAM_2011" = c(1, 0, 1, 1), "FAM_2012" = c(0, 0, 1, 0), "FAM_2013" = c(1, 1, 0, 1), "TEA_2010" = c(0, 1, 1, 1), "TEA_2011" = c(1, 0, 1, 0), "TEA_2012" = c(0, 1, 1, 0), "TEA_2013" = c(1, 0, 0, 0), "COFFEE_2010" = c(0, 1, 1, 0), "COFFEE_2011" = c(1, 1, 0, 0), "COFFEE_2012" = c(1, 0, 1, 0), "COFFEE_2013" = c(1, 0, 0, 0) ) dat ## SCORE_2010 SCORE_2011 SCORE_2012 SCORE_2013 TAX_2010 TAX_2011 TAX_2012 ... ## 1 0 1 0 1 0 1 0 ... ## 2 0 0 0 0 1 0 0 ... ## 3 1 1 1 0 1 1 1 ... ## 4 0 0 0 1 0 0 0 ...
In the data set above, we have different columns SCORE
, TAX
, FISCAL
, FAM
, TEA
, and COFFEE
. In the columns, we have either a zero or a one. The task is to identify the latest year where one occurs for each row and each variable.
So for example in the first row for the SCORE
variable we would see that the most recent one occurred in 2013. For TAX
, 2013 as well. For all the other variables it is 2013 as well. When we have a look at the second row, the most recent one occurred in none of the years for SCORE
. In the third row, the most recent one occurred in 2012 for TAX
. You get the idea.
In order to solve this problem, we will primarily be using purrr
’s pmap()
function to do row-wise operations. Let’s jump into it.
Rowwise Operations With purrr and pmap()
dat %>% purrr::pmap(~c(...)) ## [[1]] ## SCORE_2010 SCORE_2011 SCORE_2012 SCORE_2013 TAX_2010 TAX_2011 ## 0 1 0 1 0 1 ## TAX_2012 TAX_2013 FISCAL_2010 FISCAL_2011 FISCAL_2012 FISCAL_2013 ## 0 1 1 1 1 1 ## FAM_2010 FAM_2011 FAM_2012 FAM_2013 TEA_2010 TEA_2011 ## 0 1 0 1 0 1 ## TEA_2012 TEA_2013 COFFEE_2010 COFFEE_2011 COFFEE_2012 COFFEE_2013 ## 0 1 0 1 1 1 ## ## [[2]] ## SCORE_2010 SCORE_2011 SCORE_2012 SCORE_2013 TAX_2010 TAX_2011 ## 0 0 0 0 1 0 ## TAX_2012 TAX_2013 FISCAL_2010 FISCAL_2011 FISCAL_2012 FISCAL_2013 ## 0 0 1 0 0 1 ## FAM_2010 FAM_2011 FAM_2012 FAM_2013 TEA_2010 TEA_2011 ## 1 0 0 1 1 0 ## TEA_2012 TEA_2013 COFFEE_2010 COFFEE_2011 COFFEE_2012 COFFEE_2013 ## 1 0 1 1 0 0 ## ## [[3]] ## SCORE_2010 SCORE_2011 SCORE_2012 SCORE_2013 TAX_2010 TAX_2011 ## 1 1 1 0 1 1 ## TAX_2012 TAX_2013 FISCAL_2010 FISCAL_2011 FISCAL_2012 FISCAL_2013 ## 1 0 1 0 1 0 ## FAM_2010 FAM_2011 FAM_2012 FAM_2013 TEA_2010 TEA_2011 ## 1 1 1 0 1 1 ## TEA_2012 TEA_2013 COFFEE_2010 COFFEE_2011 COFFEE_2012 COFFEE_2013 ## 1 0 1 0 1 0 ## ## [[4]] ## SCORE_2010 SCORE_2011 SCORE_2012 SCORE_2013 TAX_2010 TAX_2011 ## 0 0 0 1 0 0 ## TAX_2012 TAX_2013 FISCAL_2010 FISCAL_2011 FISCAL_2012 FISCAL_2013 ## 0 1 1 0 0 1 ## FAM_2010 FAM_2011 FAM_2012 FAM_2013 TEA_2010 TEA_2011 ## 0 1 0 1 1 0 ## TEA_2012 TEA_2013 COFFEE_2010 COFFEE_2011 COFFEE_2012 COFFEE_2013 ## 0 0 0 0 0 0
First, we use pmap()
to get a list of 4 row named vectors.
dat %>% purrr::pmap(~c(...)) %>% purrr::map(~data.frame(.x) %>% tibble::rownames_to_column(var = "type") %>% dplyr::rename(value = .x) %>% head(., 4)) [[1]] type value 1 SCORE_2010 0 2 SCORE_2011 1 3 SCORE_2012 0 4 SCORE_2013 1 [[2]] type value 1 SCORE_2010 0 2 SCORE_2011 0 3 SCORE_2012 0 4 SCORE_2013 0 [[3]] type value 1 SCORE_2010 1 2 SCORE_2011 1 3 SCORE_2012 1 4 SCORE_2013 0 [[4]] type value 1 SCORE_2010 0 2 SCORE_2011 0 3 SCORE_2012 0 4 SCORE_2013 1
Second, we map over the list entries and transform the named vectors into a data frame. There are two columns with the value (0 or 1) and the column name (SCORE_2010, SCORE_2011, SCORE_2012, SCORE_2013 etc.).
dat %>% purrr::pmap(~c(...)) %>% purrr::map(~data.frame(.x) %>% tibble::rownames_to_column(var = "type") %>% dplyr::rename(value = .x)) %>% purrr::map(~tidyr::separate(., col = "type", into = c("type", "year"), sep = "_")) [[1]] type year value 1 SCORE 2010 0 2 SCORE 2011 1 3 SCORE 2012 0 4 SCORE 2013 1 5 TAX 2010 0 6 TAX 2011 1 7 TAX 2012 0 8 TAX 2013 1 . . . . . . . . . . . .
Now, we separate the column names into type and year.
dat %>% purrr::pmap(~c(...)) %>% purrr::map(~data.frame(.x) %>% tibble::rownames_to_column(var = "type") %>% dplyr::rename(value = .x)) %>% purrr::map(~tidyr::separate(., col = "type", into = c("type", "year"), sep = "_")) %>% purrr::map(~dplyr::mutate(., year = as.integer(year))) %>% purrr::map(~dplyr::arrange(., desc(value), desc(year)))
Afterward, we arrange the data frame by value and year in descending order.
dat %>% purrr::pmap(~c(...)) %>% purrr::map(~data.frame(.x) %>% tibble::rownames_to_column(var = "type") %>% dplyr::rename(value = .x)) %>% purrr::map(~tidyr::separate(., col = "type", into = c("type", "year"), sep = "_")) %>% purrr::map(~dplyr::mutate(., year = as.integer(year))) %>% purrr::map(~dplyr::arrange(., desc(value), desc(year))) %>% purrr::map(~dplyr::distinct(., type, .keep_all = TRUE)) [[1]] type year value 1 SCORE 2013 1 2 TAX 2013 1 3 FISCAL 2013 1 4 FAM 2013 1 5 TEA 2013 1 6 COFFEE 2013 1 [[2]] type year value 1 FISCAL 2013 1 2 FAM 2013 1 3 TEA 2012 1 4 COFFEE 2011 1 5 TAX 2010 1 6 SCORE 2013 0 [[3]] type year value 1 SCORE 2012 1 2 TAX 2012 1 3 FISCAL 2012 1 4 FAM 2012 1 5 TEA 2012 1 6 COFFEE 2012 1 [[4]] type year value 1 SCORE 2013 1 2 TAX 2013 1 3 FISCAL 2013 1 4 FAM 2013 1 5 TEA 2010 1 6 COFFEE 2013 0
By default, dplyr::distinct()
picks the first unique value of the type column. This is exactly what we wanted because we ordered by year and value in descending order. So, by default, we will be picking the most recent year that has a one in the value column.
dat %>% purrr::pmap(~c(...)) %>% purrr::map(~data.frame(.x) %>% tibble::rownames_to_column(var = "type") %>% dplyr::rename(value = .x)) %>% purrr::map(~tidyr::separate(., col = "type", into = c("type", "year"), sep = "_")) %>% purrr::map(~dplyr::mutate(., year = as.integer(year))) %>% purrr::map(~dplyr::arrange(., desc(value), desc(year))) %>% purrr::map(~dplyr::distinct(., type, .keep_all = TRUE)) %>% purrr::map(~tidyr::spread(., type, year)) %>% purrr::map(~dplyr::filter(., value != 0)) %>% do.call(rbind, .) %>% cbind(dat, .)
Now, we spread the data frame so the entries in the type
column become variables with the particular year where one occurred. Afterward we filter out the 0 in the Value column.
And we are done. We rbind
all the data frames in the list and cbind
the output to our original data frame. Now, for every variable (SCORE
, TAX
, FISCAL
, FAM
, TEA
, and COFFEE
) we know the most recent year where one occurred.
This was quite fun. However, the code took around 10-15 minutes to run. A bit inefficient. Hence, I tried to do it in another way. The beautiful thing about coding and data munging is that we can get creative with the ways we solve problems. There are so many ways to do one task that is very exciting to solve a problem in various ways.
Second Approach With pmap() and Other purrr Functions
First, we take the column names of the data frame and remove all the underscores and numbers. Then we create a helper function that is able to recognize the value one.
col_names <- colnames(dat) %>% stringr::str_remove_all("_.*") # helper function is_one <- function(x) x == 1
Again, we are using pmap()
to get a list of named row vectors. Then we rename the named row vectors with the one we have stored in col_names
. Afterwards, we split up the row vectors in the list by their names. So at the end, we have a list of lists of named vectors.
dat %>% purrr::pmap(~c(...) %>% purrr::set_names(col_names)) %>% purrr::map(~base::split(., names(.))) -> u
What we will be doing now is looping over the list of lists with lapply(list, lapply, …)
and identify the index which contains the first one starting from the end with purrr
’s detect_index()
function.
lapply(u, lapply, purrr::detect_index, is_one, .dir = "backward") %>% as.data.frame() -> df
The rest is just using basic dplyr
and tidyr
functions. A neat trick, to get the final years in the end, is to add a base year (in our case 2009) to all the indexes we have found.
df %>% tidyr::gather(COFFEE:TEA.3, key = "key", value = "value") %>% dplyr::mutate(key = stringr::str_remove_all(key, "\\..*")) %>% dplyr::arrange(key) %>% dplyr::mutate(ID = rep(1:nrow(dat), nrow(.) / nrow(dat))) %>% tidyr::spread(key, value) %>% dplyr::select(-ID) %>% dplyr::mutate_at(vars(everything()), ~ . + 2009) %>% dplyr::mutate_at(vars(everything()), ~ ifelse(. == 2009, NA, .)) ## COFFEE FAM FISCAL SCORE TAX TEA ## 1 2013 2013 2013 2013 2013 2013 ## 2 2011 2013 2013 NA 2010 2012 ## 3 2012 2012 2012 2012 2012 2012 ## 4 NA 2013 2013 2013 2013 2010
Comparison of Algorithms
system.time( dat %>% purrr::pmap(~c(...) %>% purrr::set_names(col_names)) %>% purrr::map(~base::split(., names(.))) %>% lapply(., lapply, detect_index, is_one, .dir = "backward") %>% as.data.frame() %>% tidyr::gather(COFFEE:TEA.3, key = "key", value = "value") %>% dplyr::mutate(key = stringr::str_remove_all(key, "\\..*")) %>% dplyr::arrange(key) %>% dplyr::mutate(ID = rep(1:nrow(dat), nrow(.) / nrow(dat))) %>% tidyr::spread(key, value) %>% dplyr::select(-ID) %>% dplyr::mutate_at(vars(everything()), ~ . + 2009) %>% dplyr::mutate_at(vars(everything()), ~ ifelse(. == 2009, NA, .)) -> second_approach ) ## user system elapsed ## 0.05 0.00 0.05 system.time( dat %>% purrr::pmap(~c(...)) %>% purrr::map(~data.frame(.x) %>% tibble::rownames_to_column(var = "type") %>% dplyr::rename(value = .x)) %>% purrr::map(~tidyr::separate(., col = "type", into = c("type", "year"), sep = "_")) %>% purrr::map(~dplyr::mutate(., year = as.integer(year))) %>% purrr::map(~dplyr::arrange(., desc(value), desc(year))) %>% purrr::map(~dplyr::distinct(., type, .keep_all = TRUE)) %>% purrr::map(~tidyr::spread(., type, year)) %>% purrr::map(~dplyr::filter(., value != 0)) %>% do.call(rbind, .) %>% as.data.frame() %>% dplyr::select(-value) -> first_approach ) ## user system elapsed ## 0.03 0.00 0.03
After comparing algorithms we can see that the second approach is faster. When we have a big data frame with many columns, this makes a big difference.
I am constantly learning and I am sure there are easier approaches out there that are way more efficient and more readable. If you know other and better ways of solving the task above, please let me know in the comments below. Thank you!
Additional Data Manipulation Resources
Recent Posts
Recent Comments
- Kardiana on The Lasso – R Tutorial (Part 3)
- Pascal Schmidt on RSelenium Tutorial: A Tutorial to Basic Web Scraping With RSelenium
- Pascal Schmidt on Dynamic Tabs, insertTab, and removeTab For More efficient R Shiny Applications
- Gisa on Persistent Data Storage With a MySQL Database in R Shiny – An Example App
- Nicholas on Dynamic Tabs, insertTab, and removeTab For More efficient R Shiny Applications