Rowwise Operations on Data Frames With purrr’s pmap() Function in R
May 2, 2019 By Pascal Schmidt R Tidyverse Tutorial
Usually, we are working with columns in R. Finding the mean, max, or median of certain variables is very straight forward. However, when we want to work with rows then we do not instantly know what to do. In this blog post, I will be explaining how to do row-wise operations on data frames in R.
When a colleague asked me how to do row-wise operations I was clueless besides using the standard apply()
approach. Afterward, I dug a bit deeper. I discovered purrr
’s pmap()
function which I want to demonstrate in this blog post together with the usual apply()
approach and the split, apply, combine approach.
As always, we are using the Pokemon data set for demonstration purposes. The Pokemon data set can be found on Kaggle.
Outline:
- The base R
apply()
approach. purrr
’spmap()
function for rowwise operations.- Split, apply, combine approach.
- Summary of approaches.
Base R’s apply() Function
First, let’s look at the base R solution with apply()
.
library(tidyverse) poke <- read.csv("Pokemon.csv") %>% dplyr::select(-c(X., Total, Generation)) head(poke, 3) ## Name Type.1 Type.2 HP Attack Defense Sp..Atk Sp..Def Speed ## 1 Bulbasaur Grass Poison 45 49 49 65 65 45 ## 2 Ivysaur Grass Poison 60 62 63 80 80 60 ## 3 Venusaur Grass Poison 80 82 83 100 100 80
Let’s say, we want to find out which of the numeric variables (Attack
, Defense
, Sp..Atk
, Sp..Def
, and Speed
) is the one where Pokemon are strongest in. If we take the first Pokemon, Bulbasaur, we see that its strongest attributes are its special attack and special defense.
First Try
In order to do that for all Pokemon, we are using apply()
first:
poke %>% dplyr::select_if(is.numeric) %>% dplyr::mutate(max_val = base::apply(., 1, max), index = base::apply(., 1, which.max), variable = colnames(.)[index]) %>% dplyr::group_by(variable) %>% dplyr::summarise(n = n()) %>% dplyr::arrange(desc(n)) %>% dplyr::mutate(variable = factor(variable, levels = variable)) -> max_attribute max_attribute ## # A tibble: 6 x 2 ## variable n ## <fct> <int> ## 1 Attack 222 ## 2 Defense 141 ## 3 Sp..Atk 137 ## 4 Speed 133 ## 5 HP 101 ## 6 Sp..Def 66 ggplot(max_attribute, aes(x = variable, y = n, fill = variable)) + geom_bar(stat = "identity") + theme(legend.position = "none", axis.title = element_blank(), axis.text.y = element_blank(), axis.ticks.y = element_blank())
The approach above works suboptimal. When we look at Bulbasaur, then there are two maximum values. With our approach, we are only catching the first maximum value in the vector. Hence, there is some selection bias in our plot.
Code explanation:
- We only select integer columns from the data frame.
- Then we record the maximum value, the index of the maximum value, and then find the column name with help of the index.
- The number one in the apply functions means that we want to loop through rows.
- Afterwards we are just using standard
dplyr
verbs to produce the plot. - We arrange the rows in descending order by
n()
and then set the levels for the variable column to plot the bars in decreasing order.
Let’s try something else that works better. In order to extract all the maximum values in a certain row, we will be creating a helper function.
Second and Better Try
# helper function which finds the maximum value all_max <- function(x) which(x == max(x)) poke %>% dplyr::select_if(is.numeric) %>% dplyr::mutate(all_max = base::apply(., 1, all_max)) %>% .[, "all_max"] %>% purrr::flatten_int() %>% base::names() %>% dplyr::as_tibble() %>% dplyr::group_by(value) %>% dplyr::summarise(n = n()) %>% dplyr::arrange(desc(n)) %>% dplyr::mutate(value = factor(value, levels = value)) %>% ggplot(., aes(x = value, y = n, fill = value)) + geom_bar(stat = "identity") + theme(legend.position = "none", axis.title = element_blank(), axis.text.y = element_blank(), axis.ticks.y = element_blank())
- In the code above, we are again choosing only numeric variables from the data frame.
- Use apply in combination with our helper function.
- Extract the
all_max
column which is a list which contains the column names where the maximum value occurs. - Afterwards, we use
purrr
’sflatten_int()
function and we get back a named vector from where we extract the names and then again use standarddplyr
verbs to put the data intoggplot()
.
Voila, an unbiased representation of the columns where the most maximum values occur.
It looks like for most Pokemon, their attacking power is their strongest attribute and HP is their weakest one. I removed everything on the y axis completely because we are only interested in relative numbers and not absolute ones.
The approach above felt a bit awkward. We can simplify our code and make it more readable with pmap()
in my opinion.
Rowwise operation with purrr’s pmap Function
poke %>% dplyr::select_if(is.numeric) %>% purrr::pmap(~ c(...)) %>% purrr::map(~ ifelse(.x != max(.x), NA, .x)) %>% base::do.call(rbind, .) %>% dplyr::as_tibble() %>% tidyr::gather(HP:Speed, key = "key", value = "value") %>% na.omit() %>% dplyr::group_by(key) %>% dplyr::summarise(n = n()) %>% dplyr::arrange(desc(n)) %>% dplyr::mutate(key = factor(key, levels = key)) %>% ggplot(., aes(x = key, y = n, fill = key)) + geom_bar(stat = "identity") + theme(legend.position = "none", axis.title = element_blank(), axis.text.y = element_blank(), axis.ticks.y = element_blank(), axis.text.x = element_text(size = 12))
Below, there is a variation without the map()
function after the pmap()
function. A little bit harder to understand because we don’t get a clear mental model of what c(…)
does.
poke %>% dplyr::select_if(is.numeric) %>% purrr::pmap(~ ifelse(c(...) != max(c(...)), NA, c(...))) %>% base::do.call(rbind, .) %>% dplyr::as_tibble() %>% tidyr::gather(HP:Speed, key = "key", value = "value") %>% na.omit() %>% dplyr::group_by(key) %>% dplyr::summarise(n = n()) %>% dplyr::arrange(desc(n)) %>% dplyr::mutate(key = factor(key, levels = key)) %>% ggplot(., aes(x = key, y = n, fill = key)) + geom_bar(stat = "identity") + theme(legend.position = "none")
Step by step code explanation:
- First, we get all numeric columns.
- Then we use
pmap()
in combination withc(…)
which binds the columns to a “row” vector. We now have a list of 800 “row” vectors. Each element in the list represents one row. - The we apply the
ifelse()
function to every element of that list. If an element of the vector is equal to the maximum value of that vector, then we keep it. Otherwise, we put an NA there. - Afterwards, we
rbind
all lists together and convert the output to a tibble (output ofdo.call
is always a matrix). - The rest is just usual
tidyr
, anddplyr
verbs to summarize the data.
The steps we took to get to the end result resulted in a bit more steps than the apply()
approach. However, the mental picture we get with pmap(~c(…))
makes it very easy to proceed and manipulate the data further.
Split, Apply, Combine Approach
This next approach is another variation of how to do rowwise operations in R. I got the idea from David Robinson when he replied to another person’s Twitter question. A bit slower but yet another way.
poke %>% dplyr::select_if(is.numeric) %>% base::split(., seq_len(nrow(.))) %>% purrr::map(~ ifelse(.x != max(.x), NA, .x)) %>% purrr::map(~ purrr::flatten_int(.)) %>% base::do.call(rbind, .) %>% dplyr::as_tibble() %>% purrr::set_names(poke %>% dplyr::select_if(is.numeric) %>% colnames(.)) %>% tidyr::gather(HP:Speed, key = "key", value = "value") %>% na.omit() %>% dplyr::group_by(key) %>% dplyr::summarise(n = n()) %>% dplyr::arrange(desc(n)) %>% dplyr::mutate(key = factor(key, levels = key)) %>% ggplot(., aes(x = key, y = n, fill = key)) + geom_bar(stat = "identity") + theme(legend.position = "none", axis.title = element_blank(), axis.text.y = element_blank(), axis.ticks.y = element_blank(), axis.text.x = element_text(size = 12))
The function that allows us to do row-wise operation is base::split(., seq_len(nrow()))
. Afterward, we have to do some more manipulations in comparison to the other approaches above.
Again, note that we have used purrr
’s flatten_int()
function which turns a list into an integer vector in this case.
Moreover, we have also used purrr
’s set_names()
function which is useful when you want to name columns of a data frame. We could have also used:
names(poke) <- c("Name", "Type.1", "Type.2", "HP", "Attack", "Defense", "Sp_Atk", "Sp_Def", "Speed", "Legendary")
Another Example of purrr’s pmap() Function
Let’s say, we want to combine the total strength of each Pokemon. Then we would go about it like this.
poke %>% dplyr::mutate(Total = purrr::pmap_int(poke %>% select_if(is.numeric), sum)) %>% head() ## Name Type.1 Type.2 HP Attack Defense Sp_Atk Sp_Def Speed Legendary Total ## 1 Bulbasaur Grass Poison 45 49 49 65 65 45 False 318 ## 2 Ivysaur Grass Poison 60 62 63 80 80 60 False 405 ## 3 Venusaur Grass Poison 80 82 83 100 100 80 False 525
Rowwise Operation With pmap() on a Subset of Variables
So far, we have been selecting all the variables in the data frame. However, what if you only want to select a subset of columns?
poke %>% dplyr::mutate(Att_Def = purrr::pmap_int(list(Attack, Defense), function(x, y) x + y)) %>% head(., 3) ## Name ... Att_Def ## 1 Bulbasaur ... 98 ## 2 Ivysaur ... 125
Summary of Rowwise Operation
Apply()
Pro:
- Easy and efficient to use.
Cons:
- Requires a bit of uncoventional pipe operations.
- Mental model requires a bit of experience (1 means row operaion, 2 mean column operation).
purrr and pmap()
Pro:
- Easy to integrate with the pipe.
- Clear mental model with
purrr::pmap(~c(…))
. - Create function on the fly with lamda expression
~
.
Cons:
- A bit slower than apply in some cases.
Split, Apply, Combine
Pro:
- I cannot think of anything besides doing what it is supposed to do.
Cons:
- Slower.
- Requires some more steps for data manipulation afterwards.
Additional Data Manipulation Resources
Recent Posts
Recent Comments
- Kardiana on The Lasso – R Tutorial (Part 3)
- Pascal Schmidt on RSelenium Tutorial: A Tutorial to Basic Web Scraping With RSelenium
- Pascal Schmidt on Dynamic Tabs, insertTab, and removeTab For More efficient R Shiny Applications
- Gisa on Persistent Data Storage With a MySQL Database in R Shiny – An Example App
- Nicholas on Dynamic Tabs, insertTab, and removeTab For More efficient R Shiny Applications