Multiple Ways of Doing Vectorization in R – Speeding up For Loops

Today, we will be talking about vectorization in R. This topic is especially important to R because for loops do not run very fast. So instead of using the great Rcpp library, which requires C++, we can take advantage of vectorization.
What we will be covering:
  • What is vectorization?
  •  Base R implementation of a vectorized function
  •  Vectorization with base::vectorize()
  •  Vectorization with purrr
  • Comparison of methods

What is Vectorization in R?

To give you an intuition behind vectorization and what it actually means, we can start with a simple example.
x <- c(2, 4, 6, 8, 10)
x / 2

# [1] 1 2 3 4 5

As you can see, the division happened to each element in the vector without needing a for loop. The same task from above could have been handled with a for loop as well.

for(i in seq_along(x)) {
  
  x[i] <- x[i] / 2
  
}
x

# [1] 1 2 3 4 5
A lot of functions are already vectorized in R and so you do not even have to think about the concept of vectorization a lot.
When I first heard about vectorization in R, I found it to be a very abstract term and didn’t know what was meant by it. So just to clarify, vectorization is just a function that uses a regular for loop which has been implemented in a lower language such as Fortran or C++. In short, someone else did the dirty work already and wrote a for loop for us. With vectorization we get the speed of lower level language loops.

Base R Implementation of a Vectorized Fucntion

Let’s consider the Pokemon data set.
library(tidyverse)
poke <- readr::read_csv(here::here("Pokemon.csv"))

## # A tibble: 800 x 13
##      # Name  Type 1 Type 2 Total    HP Attack Defense Sp. Atk Sp. Def
##    <dbl> <chr> <chr>    <chr>    <dbl> <dbl>  <dbl>   <dbl>     <dbl>     <dbl>
##  1     1 Bulb~ Grass    Poison     318    45     49      49        65        65
##  2     2 Ivys~ Grass    Poison     405    60     62      63        80        80
##  3     3 Venu~ Grass    Poison     525    80     82      83       100       100
##  4     3 Venu~ Grass    Poison     625    80    100     123       122       120
##  5     4 Char~ Fire     <NA>       309    39     52      43        60        50
##  6     5 Char~ Fire     <NA>       405    58     64      58        80        65
##  7     6 Char~ Fire     Flying     534    78     84      78       109        85
##  8     6 Char~ Fire     Dragon     634    78    130     111       130        85
##  9     6 Char~ Fire     Flying     634    78    104      78       159       115
## 10     7 Squi~ Water    <NA>       314    44     48      65        50        64
## # ... with 790 more rows, and 3 more variables: Speed <dbl>, Generation <dbl>,
## #   Legendary <lgl>

The iflese() function in R is vectorized and I have been using it a lot in combination with the mutate() function in R.

poke %>%
  dplyr::filter(Type 1 %in% c("Fire", "Water")) %>%
  mutate(Type 1 = base::ifelse(Type 1 == "Fire", "hot", "cold"))

## # A tibble: 164 x 13
##      # Name  Type 1 Type 2 Total    HP Attack Defense Sp. Atk Sp. Def
##    <dbl> <chr> <chr>    <chr>    <dbl> <dbl>  <dbl>   <dbl>     <dbl>     <dbl>
##  1     4 Char~ hot      <NA>       309    39     52      43        60        50
##  2     5 Char~ hot      <NA>       405    58     64      58        80        65
##  3     6 Char~ hot      Flying     534    78     84      78       109        85
##  4     6 Char~ hot      Dragon     634    78    130     111       130        85
##  5     6 Char~ hot      Flying     634    78    104      78       159       115
##  6     7 Squi~ cold     <NA>       314    44     48      65        50        64
##  7     8 Wart~ cold     <NA>       405    59     63      80        65        80
##  8     9 Blas~ cold     <NA>       530    79     83     100        85       105
##  9     9 Blas~ cold     <NA>       630    79    103     120       135       115
## 10    37 Vulp~ hot      <NA>       299    38     41      40        50        65
## # ... with 154 more rows, and 3 more variables: Speed <dbl>, Generation <dbl>,
## #   Legendary <lgl>

Every element in the column Type 1 is being modified because of vectorization. Let’s write our own ifelse() function.

Vectorization With base::Vectorize()

if_else_statement <- function(vec_element) {
  
  if(vec_element == "Fire") {
    
    vec_element = "hot"
    
  } else {
    
    vec_element = "cold"
    
  }
  
  return(vec_element)
  
}

if_else_statement(poke$Type 1[1])
## [1] "cold"

if_else_statement(poke$Type 1[1:5])
## Warning in if (vec_element == "Fire") {: the condition has length > 1 and only
## the first element will be used
## [1] "cold"

The problem with the function above is that it takes only in one element. When we put a vector into the function as an argument, then we get an error that only the first element will be used. One alternative would be to implement a for loop within the function.

if_else_statement <- function(vec) {
  
  for(i in seq_along(vec)) {
    
    if(vec[i] == "Fire") {
      
      vec[i] = "hot"
      
    } else {
      
      vec[i] = "cold"
      
    }
  }
  
  return(vec)
  
}

if_else_statement(poke$Type 1[1:5])
## [1] "cold" "cold" "cold" "cold" "hot"

Now, our functions does not throw an error and the desired operation is being applied to every element in the vector. Another way to solve our problem would be as follows:

if_else_statement <- function(vec_element) {
  
  if(vec_element == "Fire") {
    
    vec_element = "hot"
    
  } else {
    
    vec_element = "cold"
    
  }
  
  return(vec_element)
  
}

vectorized_if_else <- base::Vectorize(if_else_statement)
vectorized_if_else(poke[poke$Type 1 %in% c("Fire", "Water"), ]$Type 1[1:10])
##   Fire   Fire   Fire   Fire   Fire  Water  Water  Water  Water   Fire 
##  "hot"  "hot"  "hot"  "hot"  "hot" "cold" "cold" "cold" "cold"  "hot"

base::Vectorize() converts a scalar function to a vector function. base::Vectorize() is a base R function that vectorized our non-vectorized if_else_statement() scalar function.

Another good way to vectorize functions would be with the purrr package. If you have worked with R before then you probably know the ncol() function, which is not vectorized. However, with the purrr package we can vectorize ncol() and can make use of the C++ implementation.

Vectorization with purrr

x <- list(
  data.frame(food = c("ice cream", "pizza", "hot dog")),
  data.frame(drinks = c("pineapple juice", "beer", "lemonade")),
  data.frame(city = c("Vancouver", "Munich", "Stuttgart"))
)

# does not work as expected
x %>%
  nrow()
## NULL

# does work as expected
x %>%
  purrr::map_int(~nrow(.))
## [1] 3 3 3

Another example would be the table() function. I usually use it to count elements in a vector.

library(gapminder)
table(gapminder$continent)

## 
##   Africa Americas     Asia   Europe  Oceania 
##      624      300      396      360       24

The table() function is not vectorized when we desire an output such as above. Hence, we can use purrr again, when we want to count the elements in multiple columns.

mtcars[, c("gear", "am", "carb")] %>%
  purrr::map(~ table(.))

## $gear
## .
##  3  4  5 
## 15 12  5 
## 
## $am
## .
##  0  1 
## 19 13 
## 
## $carb
## .
##  1  2  3  4  6  8 
##  7 10  3 10  1  1

Let’s even loop over data sets and their columns like this:

list(mtcars[, c("gear", "am", "carb")], 
     gapminder[, c("continent", "year")]) -> df

df %>%
  purrr::map(~ purrr::map(., ~table(.))) %>%
  purrr::flatten()

## $gear
## .
##  3  4  5 
## 15 12  5 
## 
## $am
## .
##  0  1 
## 19 13 
## 
## $carb
## .
##  1  2  3  4  6  8 
##  7 10  3 10  1  1 
## 
## $continent
## .
##   Africa Americas     Asia   Europe  Oceania 
##      624      300      396      360       24 
## 
## $year
## .
## 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 2002 2007 
##  142  142  142  142  142  142  142  142  142  142  142  142
Vectorization is a great thing in R and can lead to a substantial decrease in execution time of your code.
Finally, let’s time our different approaches.

Comparison of Methods

nums <- seq(100000, 1000000, by = 100000)
diff_len <- vector(mode = "list", length = length(nums))

for(i in seq_along(nums)) {
  
  diff_len[[i]] <- sample(rep(c("Fire", "Water"), nums[i]), replace = TRUE)
  
}

##############################
### Base ifelse() function ###
##############################

time <- vector(mode = "list", length = length(nums))
for(i in seq_along(nums)) {
  
  system.time(
    
    base::ifelse(diff_len[[i]] == "Fire", "hot", "cold")
    
  ) -> time[[i]]

}

time %>%
  purrr::map([, 1) %>%
  purrr::flatten_dbl() %>%
  unname() -> base_vec


################
### for loop ###
################

if_else_statement <- function(vec) {
  
  for(i in seq_along(vec)) {
    
    if(vec[i] == "Fire") {
      
      vec[i] = "hot"
      
    } else {
      
      vec[i] = "cold"
      
    }
  }
  
  return(vec)
  
}

for(i in seq_along(nums)) {
  
  system.time(
    
    if_else_statement(diff_len[[i]])
    
  ) -> time[[i]]

}

time %>%
  purrr::map([, 1) %>%
  purrr::flatten_dbl() %>%
  unname() -> for_loop


###########################
### Dirty vectorization ###
###########################

if_else_statement <- function(vec_element) {
  
  if(vec_element == "Fire") {
    
    vec_element = "hot"
    
  } else {
    
    vec_element = "cold"
    
  }
  
  return(vec_element)
  
}

vectorize_if_else <- base::Vectorize(if_else_statement)

for(i in seq_along(nums)) {
  
  system.time(
    
    vectorize_if_else(diff_len[[i]])
    
  ) -> time[[i]]

}

time %>%
  purrr::map([, 1) %>%
  purrr::flatten_dbl() %>%
  unname() -> dirty_vec
# results
data.frame(base_vec = base_vec, 
           for_loop = for_loop, 
           dirty_vec = dirty_vec, 
           n = seq(100000, 1000000, by = 100000)) %>%
  tidyr::gather(base_vec:dirty_vec, key = "methods", value = "time") %>%
  dplyr::as_tibble() %>%
  ggplot(aes(x = n, y = time, col = methods)) +
  geom_point() +
  geom_line() +
  ylab("Time (sec)") +
  xlab("N")

vectorization in R, purrr

As we can see from the graph above, the base::vectorize() function does not give us any performance improvements. It’s useful if you want a quick and dirty way of making a vectorized function.
I hope this tutorial was informative and when you have any questions, let me know in the comments below. Thank you.

Post your comment