Rowwise Operations on Data Frames With purrr’s pmap() Function in R

Usually, we are working with columns in R. Finding the mean, max, or median of certain variables is very straight forward. However, when we want to work with rows then we do not instantly know what to do. In this blog post, I will be explaining how to do row-wise operations on data frames in R.

When a colleague asked me how to do row-wise operations I was clueless besides using the standard apply() approach. Afterward, I dug a bit deeper. I discovered purrr’s pmap() function which I want to demonstrate in this blog post together with the usual apply() approach and the split, apply, combine approach.

As always, we are using the Pokemon data set for demonstration purposes. The Pokemon data set can be found on Kaggle.

Outline:

  • The base R apply() approach.
  • purrr’s pmap() function for rowwise operations.
  • Split, apply, combine approach.
  • Summary of approaches.

purrr pmap rowwise operations in R

Base R’s apply() Function

First, let’s look at the base R solution with apply().

library(tidyverse)
poke <- read.csv("Pokemon.csv") %>%
  dplyr::select(-c(X., Total, Generation))

head(poke, 3)

##        Name Type.1 Type.2 HP Attack Defense Sp..Atk Sp..Def Speed
## 1 Bulbasaur  Grass Poison 45     49      49      65      65    45
## 2   Ivysaur  Grass Poison 60     62      63      80      80    60
## 3  Venusaur  Grass Poison 80     82      83     100     100    80

Let’s say, we want to find out which of the numeric variables (Attack, Defense, Sp..Atk, Sp..Def, and Speed) is the one where Pokemon are strongest in. If we take the first Pokemon, Bulbasaur, we see that its strongest attributes are its special attack and special defense.

First Try

In order to do that for all Pokemon, we are using apply() first:

poke %>%
  dplyr::select_if(is.numeric) %>%
  dplyr::mutate(max_val = base::apply(., 1, max),
                index = base::apply(., 1, which.max),
                variable = colnames(.)[index]) %>%
  dplyr::group_by(variable) %>%
  dplyr::summarise(n = n()) %>%
  dplyr::arrange(desc(n)) %>%
  dplyr::mutate(variable = factor(variable, levels = variable)) -> max_attribute

max_attribute

## # A tibble: 6 x 2
##   variable     n
##   <fct>    <int>
## 1 Attack     222
## 2 Defense    141
## 3 Sp..Atk    137
## 4 Speed      133
## 5 HP         101
## 6 Sp..Def     66

ggplot(max_attribute, aes(x = variable, y = n, fill = variable)) +
  geom_bar(stat = "identity") +
  theme(legend.position = "none",
        axis.title = element_blank(),
        axis.text.y = element_blank(),
        axis.ticks.y = element_blank())

rowwise operations with pmap and purrr

The approach above works suboptimal. When we look at Bulbasaur, then there are two maximum values. With our approach, we are only catching the first maximum value in the vector. Hence, there is some selection bias in our plot.

Code explanation:

  • We only select integer columns from the data frame.
  • Then we record the maximum value, the index of the maximum value, and then find the column name with help of the index.
  • The number one in the apply functions means that we want to loop through rows.
  • Afterwards we are just using standard dplyr verbs to produce the plot.
  • We arrange the rows in descending order by n() and then set the levels for the variable column to plot the bars in decreasing order.

Let’s try something else that works better. In order to extract all the maximum values in a certain row, we will be creating a helper function.

Second and Better Try

# helper function which finds the maximum value
all_max <- function(x) which(x == max(x))

poke %>%
  dplyr::select_if(is.numeric) %>%
  dplyr::mutate(all_max = base::apply(., 1, all_max)) %>%
  .[, "all_max"] %>%
  purrr::flatten_int() %>%
  base::names() %>%
  dplyr::as_tibble() %>%
  dplyr::group_by(value) %>%
  dplyr::summarise(n = n()) %>%
  dplyr::arrange(desc(n)) %>%
  dplyr::mutate(value = factor(value, levels = value)) %>%
  ggplot(., aes(x = value, y = n, fill = value)) +
  geom_bar(stat = "identity") +
  theme(legend.position = "none",
        axis.title = element_blank(),
        axis.text.y = element_blank(),
        axis.ticks.y = element_blank())
  • In the code above, we are again choosing only numeric variables from the data frame.
  • Use apply in combination with our helper function.
  • Extract the all_max column which is a list which contains the column names where the maximum value occurs.
  • Afterwards, we use purrr’s flatten_int() function and we get back a named vector from where we extract the names and then again use standard dplyr verbs to put the data into ggplot().

Voila, an unbiased representation of the columns where the most maximum values occur.

rowwise operations with pmap and purrr

It looks like for most Pokemon, their attacking power is their strongest attribute and HP is their weakest one. I removed everything on the y axis completely because we are only interested in relative numbers and not absolute ones.

The approach above felt a bit awkward. We can simplify our code and make it more readable with pmap() in my opinion.

Rowwise operation with purrr’s pmap Function

poke %>%
  dplyr::select_if(is.numeric) %>%
  purrr::pmap(~ c(...)) %>%
  purrr::map(~ ifelse(.x != max(.x), NA, .x)) %>%
  base::do.call(rbind, .)  %>%
  dplyr::as_tibble() %>%
  tidyr::gather(HP:Speed, key = "key", value = "value") %>%
  na.omit() %>%
  dplyr::group_by(key) %>%
  dplyr::summarise(n = n()) %>%
  dplyr::arrange(desc(n)) %>%
  dplyr::mutate(key = factor(key, levels = key)) %>%
  ggplot(., aes(x = key, y = n, fill = key)) +
  geom_bar(stat = "identity") + 
  theme(legend.position = "none",
        axis.title = element_blank(),
        axis.text.y = element_blank(),
        axis.ticks.y = element_blank(),
        axis.text.x = element_text(size = 12))

rowwise operations with pmap and purrr

Below, there is a variation without the map() function after the pmap() function. A little bit harder to understand because we don’t get a clear mental model of what c(…) does.

poke %>%
  dplyr::select_if(is.numeric) %>%
  purrr::pmap(~ ifelse(c(...) != max(c(...)), NA, c(...))) %>%
  base::do.call(rbind, .)  %>%
  dplyr::as_tibble() %>%
  tidyr::gather(HP:Speed, key = "key", value = "value") %>%
  na.omit() %>%
  dplyr::group_by(key) %>%
  dplyr::summarise(n = n()) %>%
  dplyr::arrange(desc(n)) %>%
  dplyr::mutate(key = factor(key, levels = key)) %>%
  ggplot(., aes(x = key, y = n, fill = key)) +
    geom_bar(stat = "identity") +
    theme(legend.position = "none")

Step by step code explanation:

  • First, we get all numeric columns.
  • Then we use pmap() in combination with c(…) which binds the columns to a “row” vector. We now have a list of 800 “row” vectors. Each element in the list represents one row.
  • The we apply the ifelse() function to every element of that list. If an element of the vector is equal to the maximum value of that vector, then we keep it. Otherwise, we put an NA there.
  • Afterwards, we rbind all lists together and convert the output to a tibble (output of do.call is always a matrix).
  • The rest is just usual tidyr, and dplyr verbs to summarize the data.

The steps we took to get to the end result resulted in a bit more steps than the apply() approach. However, the mental picture we get with pmap(~c(…)) makes it very easy to proceed and manipulate the data further.

Split, Apply, Combine Approach

This next approach is another variation of how to do rowwise operations in R. I got the idea from David Robinson when he replied to another person’s Twitter question. A bit slower but yet another way.

poke %>%
  dplyr::select_if(is.numeric) %>%
  base::split(., seq_len(nrow(.))) %>%
  purrr::map(~ ifelse(.x != max(.x), NA, .x)) %>%
  purrr::map(~ purrr::flatten_int(.)) %>%
  base::do.call(rbind, .) %>%
  dplyr::as_tibble() %>%
  purrr::set_names(poke %>%
                     dplyr::select_if(is.numeric) %>%
                     colnames(.)) %>%
  tidyr::gather(HP:Speed, key = "key", value = "value") %>%
  na.omit() %>%
  dplyr::group_by(key) %>%
  dplyr::summarise(n = n()) %>%
  dplyr::arrange(desc(n)) %>%
  dplyr::mutate(key = factor(key, levels = key)) %>%
  ggplot(., aes(x = key, y = n, fill = key)) +
  geom_bar(stat = "identity") +
  theme(legend.position = "none",
        axis.title = element_blank(),
        axis.text.y = element_blank(),
        axis.ticks.y = element_blank(),
        axis.text.x = element_text(size = 12))

rowwise operations with pmap and purrr

The function that allows us to do row-wise operation is base::split(., seq_len(nrow())). Afterward, we have to do some more manipulations in comparison to the other approaches above.

Again, note that we have used purrr’s flatten_int() function which turns a list into an integer vector in this case.

Moreover, we have also used purrr’s set_names() function which is useful when you want to name columns of a data frame. We could have also used:

names(poke) <- c("Name", "Type.1", "Type.2", 
                 "HP", "Attack", "Defense", 
                 "Sp_Atk", "Sp_Def", "Speed", "Legendary")

Another Example of purrr’s pmap() Function

Let’s say, we want to combine the total strength of each Pokemon. Then we would go about it like this.

poke %>%
  dplyr::mutate(Total = purrr::pmap_int(poke %>%
                                          select_if(is.numeric), sum)) %>%
  head()

##         Name Type.1 Type.2 HP Attack Defense Sp_Atk Sp_Def Speed Legendary Total
## 1  Bulbasaur  Grass Poison 45     49      49     65     65    45     False   318
## 2    Ivysaur  Grass Poison 60     62      63     80     80    60     False   405
## 3   Venusaur  Grass Poison 80     82      83    100    100    80     False   525

Rowwise Operation With pmap() on a Subset of Variables

So far, we have been selecting all the variables in the data frame. However, what if you only want to select a subset of columns?

poke %>%
  dplyr::mutate(Att_Def = purrr::pmap_int(list(Attack, Defense), function(x, y) x + y)) %>%
  head(., 3)

##    Name       ...  Att_Def
## 1  Bulbasaur  ...  98
## 2  Ivysaur    ...  125

Summary of Rowwise Operation

Apply()

Pro:

  • Easy and efficient to use.

Cons:

  • Requires a bit of uncoventional pipe operations.
  • Mental model requires a bit of experience (1 means row operaion, 2 mean column operation).

purrr and pmap()

Pro:

  • Easy to integrate with the pipe.
  • Clear mental model with purrr::pmap(~c(…)).
  • Create function on the fly with lamda expression ~.

Cons:

  • A bit slower than apply in some cases.

Split, Apply, Combine

Pro:

  • I cannot think of anything besides doing what it is supposed to do.

Cons:

  • Slower.
  • Requires some more steps for data manipulation afterwards.

Additional Data Manipulation Resources

Post your comment