Vectorization, Purrr, and Mutate

March 20, 2022 By Pascal Schmidt R Tidyverse Tutorial

Sometimes, R is a bit too intuitive, and I wondered what was wrong with my code the other day was. The problem was vectorized functions within a mutate statement. I usually use the paste function and the ifelse function within mutate so the vectorization is already automatic. However, for a specific task at work, I was working with a non vectorized function and it took me a little bit to figure out what was wrong with my code.

So I decided to write a little post as a reminder for myself, how vectorized functions in mutate work.

Let’s start with some sample data.

sample_df <- dplyr::tibble(
  list_col = list(c("a", "b", "c"), c("a", "b"), "c", c("e", "f")),
  d = c(1, 2, 3, 4)
) 
sample_df

## # A tibble: 4 × 2
##   list_col      d
##   <list>    <dbl>
## 1 <chr [3]>     1
## 2 <chr [2]>     2
## 3 <chr [1]>     3
## 4 <chr [2]>     4

In the data frame above we have 2 columns. A list column with character vectors and one integer column. Now, we want to get the length of the vectors for each row and create a new column. Naively, I tried something like that…

sample_df %>% 
  dplyr::mutate(
    length_vec = length(list_col) 
  )

## # A tibble: 4 × 3
##   list_col      d length_vec
##   <list>    <dbl>      <int>
## 1 <chr [3]>     1          4
## 2 <chr [2]>     2          4
## 3 <chr [1]>     3          4
## 4 <chr [2]>     4          4

For my task at work, I was working with JSON data but the example above demonstrates the problem I had. Instead of getting the length of each individual vector in the list_col rows, I was getting the length of the list_col list or the number of rows of the data frame. Now if I do …

length(sample_df$list_col)

## [1] 4

… I get a scalar, or a vector of length 1, back. The way R works is that it recycles the output and fills up the column, length_vec with all 4s.

To illustrate this behavior, we can create a data frame like this:

data.frame(
  a = 1, 
  b = 1:2, 
  c = 1:5,
  d = letters[1:10]
)

##    a b c d
## 1  1 1 1 a
## 2  1 2 2 b
## 3  1 1 3 c
## 4  1 2 4 d
## 5  1 1 5 e
## 6  1 2 1 f
## 7  1 1 2 g
## 8  1 2 3 h
## 9  1 1 4 i
## 10 1 2 5 j

dplyr::tibble(
  a = "letter:",
  d = letters[1:10]
)

## # A tibble: 10 × 2
##    a       d    
##    <chr>   <chr>
##  1 letter: a    
##  2 letter: b    
##  3 letter: c    
##  4 letter: d    
##  5 letter: e    
##  6 letter: f    
##  7 letter: g    
##  8 letter: h    
##  9 letter: i    
## 10 letter: j

For tibbles, we get a warning with the first creation of a data frame because it says, only values of size one are recycled. Also, it will only be repeated a whole number of times if necessary for the data frame.

That’s what basically happened to me.

Fixing Vectorization with Purrr::map

To fix the issue, we can simply use purrr in the mutate function and then get the length of each vector.

sample_df %>% 
  dplyr::mutate(
    length_vec = purrr::map_int(list_col, ~ length(.)) 
  )

## # A tibble: 4 × 3
##   list_col      d length_vec
##   <list>    <dbl>      <int>
## 1 <chr [3]>     1          3
## 2 <chr [2]>     2          2
## 3 <chr [1]>     3          1
## 4 <chr [2]>     4          2

To illustrate the problem more, consider the code below.

  • For the first function, we are using a for loop o vectorize the vec_fn_above_below function.
  • The second function is vectorized by using the Vectorize function in R.
  • In the mutate function, for cat_3, we use ifelse which is by default vectorized in R.
  • For cat_4, we vectorize the function by using purrr::map_int.
vec_fn_above_below <- function(column_name) {
  res <- base::vector(mode = 'character', length = length(column_name))
  for (i in seq_along(column_name)) {
    if(column_name[i] >= 0) {
      res[i] <- "above"
    } else {
      res[i] <- "below"
    }
  }
  return(res)
}

fn_above_below <- function(column_name) {
  if(column_name >= 0) {
    res <- "above"
  } else {
    res <- "below"
  }
  return(res)
}
fn_above_below <- base::Vectorize(fn_above_below)


df <- dplyr::tibble(
  numbers = sample(-10:10, size = 10)
)

df %>% 
  dplyr::mutate(
    cat = vec_fn_above_below(numbers),
    cat_2 = fn_above_below(numbers),
    cat_3 = ifelse(numbers >= 0, "above", "below"),
    cat_4 = purrr::map_chr(
      numbers, 
      function(x) {
        if(x >= 0) {
          res <- "above"
        } else {
          res <- "below"
        }
        return(res)
      }
    ),
    cat_5 = sum(c(identical(cat, cat_2), identical(cat_2, cat_3), identical(cat_3, cat_4))) == 3
  )

## # A tibble: 10 × 6
##    numbers cat   cat_2 cat_3 cat_4 cat_5
##      <int> <chr> <chr> <chr> <chr> <lgl>
##  1      -8 below below below below TRUE 
##  2       0 above above above above TRUE 
##  3       9 above above above above TRUE 
##  4       3 above above above above TRUE 
##  5      -2 below below below below TRUE 
##  6       7 above above above above TRUE 
##  7      -9 below below below below TRUE 
##  8     -10 below below below below TRUE 
##  9      -7 below below below below TRUE 
## 10      -3 below below below below TRUE

All categories give the same solution.

All functions give the same results.

Additional Links

Post your comment