Multiple Ways of Doing Vectorization in R – Speeding up For Loops
September 21, 2019 By Pascal Schmidt programming R
Rcpp
library, which requires C++, we can take advantage of vectorization.- What is vectorization?
- Base R implementation of a vectorized function
- Vectorization with
base::vectorize()
- Vectorization with
purrr
- Comparison of methods
What is Vectorization in R?
x <- c(2, 4, 6, 8, 10) x / 2 # [1] 1 2 3 4 5
As you can see, the division happened to each element in the vector without needing a for loop. The same task from above could have been handled with a for loop as well.
for(i in seq_along(x)) { x[i] <- x[i] / 2 } x # [1] 1 2 3 4 5
Base R Implementation of a Vectorized Fucntion
library(tidyverse) poke <- readr::read_csv(here::here("Pokemon.csv")) ## # A tibble: 800 x 13 ###
NameType 1
Type 2
Total HP Attack DefenseSp. Atk
Sp. Def
## <dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 1 Bulb~ Grass Poison 318 45 49 49 65 65 ## 2 2 Ivys~ Grass Poison 405 60 62 63 80 80 ## 3 3 Venu~ Grass Poison 525 80 82 83 100 100 ## 4 3 Venu~ Grass Poison 625 80 100 123 122 120 ## 5 4 Char~ Fire <NA> 309 39 52 43 60 50 ## 6 5 Char~ Fire <NA> 405 58 64 58 80 65 ## 7 6 Char~ Fire Flying 534 78 84 78 109 85 ## 8 6 Char~ Fire Dragon 634 78 130 111 130 85 ## 9 6 Char~ Fire Flying 634 78 104 78 159 115 ## 10 7 Squi~ Water <NA> 314 44 48 65 50 64 ## # ... with 790 more rows, and 3 more variables: Speed <dbl>, Generation <dbl>, ## # Legendary <lgl>
The iflese()
function in R is vectorized and I have been using it a lot in combination with the mutate()
function in R.
poke %>% dplyr::filter(Type 1
%in% c("Fire", "Water")) %>% mutate(Type 1
= base::ifelse(Type 1
== "Fire", "hot", "cold")) ## # A tibble: 164 x 13 ###
NameType 1
Type 2
Total HP Attack DefenseSp. Atk
Sp. Def
## <dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 4 Char~ hot <NA> 309 39 52 43 60 50 ## 2 5 Char~ hot <NA> 405 58 64 58 80 65 ## 3 6 Char~ hot Flying 534 78 84 78 109 85 ## 4 6 Char~ hot Dragon 634 78 130 111 130 85 ## 5 6 Char~ hot Flying 634 78 104 78 159 115 ## 6 7 Squi~ cold <NA> 314 44 48 65 50 64 ## 7 8 Wart~ cold <NA> 405 59 63 80 65 80 ## 8 9 Blas~ cold <NA> 530 79 83 100 85 105 ## 9 9 Blas~ cold <NA> 630 79 103 120 135 115 ## 10 37 Vulp~ hot <NA> 299 38 41 40 50 65 ## # ... with 154 more rows, and 3 more variables: Speed <dbl>, Generation <dbl>, ## # Legendary <lgl>
Every element in the column Type 1
is being modified because of vectorization. Let’s write our own ifelse()
function.
Vectorization With base::Vectorize()
if_else_statement <- function(vec_element) { if(vec_element == "Fire") { vec_element = "hot" } else { vec_element = "cold" } return(vec_element) } if_else_statement(poke$Type 1
[1]) ## [1] "cold" if_else_statement(poke$Type 1
[1:5]) ## Warning in if (vec_element == "Fire") {: the condition has length > 1 and only ## the first element will be used ## [1] "cold"
The problem with the function above is that it takes only in one element. When we put a vector into the function as an argument, then we get an error that only the first element will be used. One alternative would be to implement a for loop within the function.
if_else_statement <- function(vec) {
for(i in seq_along(vec)) {
if(vec[i] == "Fire") {
vec[i] = "hot"
} else {
vec[i] = "cold"
}
}
return(vec)
}
if_else_statement(poke$Type 1
[1:5])
## [1] "cold" "cold" "cold" "cold" "hot"
Now, our functions does not throw an error and the desired operation is being applied to every element in the vector. Another way to solve our problem would be as follows:
if_else_statement <- function(vec_element) { if(vec_element == "Fire") { vec_element = "hot" } else { vec_element = "cold" } return(vec_element) } vectorized_if_else <- base::Vectorize(if_else_statement) vectorized_if_else(poke[poke$Type 1
%in% c("Fire", "Water"), ]$Type 1
[1:10]) ## Fire Fire Fire Fire Fire Water Water Water Water Fire ## "hot" "hot" "hot" "hot" "hot" "cold" "cold" "cold" "cold" "hot"
base::Vectorize()
converts a scalar function to a vector function. base::Vectorize()
is a base R function that vectorized our non-vectorized if_else_statement()
scalar function.
Another good way to vectorize functions would be with the purrr
package. If you have worked with R before then you probably know the ncol()
function, which is not vectorized. However, with the purrr
package we can vectorize ncol()
and can make use of the C++ implementation.
Vectorization with purrr
x <- list( data.frame(food = c("ice cream", "pizza", "hot dog")), data.frame(drinks = c("pineapple juice", "beer", "lemonade")), data.frame(city = c("Vancouver", "Munich", "Stuttgart")) ) # does not work as expected x %>% nrow() ## NULL # does work as expected x %>% purrr::map_int(~nrow(.)) ## [1] 3 3 3
Another example would be the table()
function. I usually use it to count elements in a vector.
library(gapminder) table(gapminder$continent) ## ## Africa Americas Asia Europe Oceania ## 624 300 396 360 24
The table()
function is not vectorized when we desire an output such as above. Hence, we can use purrr
again, when we want to count the elements in multiple columns.
mtcars[, c("gear", "am", "carb")] %>% purrr::map(~ table(.)) ## $gear ## . ## 3 4 5 ## 15 12 5 ## ## $am ## . ## 0 1 ## 19 13 ## ## $carb ## . ## 1 2 3 4 6 8 ## 7 10 3 10 1 1
Let’s even loop over data sets and their columns like this:
list(mtcars[, c("gear", "am", "carb")], gapminder[, c("continent", "year")]) -> df df %>% purrr::map(~ purrr::map(., ~table(.))) %>% purrr::flatten() ## $gear ## . ## 3 4 5 ## 15 12 5 ## ## $am ## . ## 0 1 ## 19 13 ## ## $carb ## . ## 1 2 3 4 6 8 ## 7 10 3 10 1 1 ## ## $continent ## . ## Africa Americas Asia Europe Oceania ## 624 300 396 360 24 ## ## $year ## . ## 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 2002 2007 ## 142 142 142 142 142 142 142 142 142 142 142 142
Comparison of Methods
nums <- seq(100000, 1000000, by = 100000) diff_len <- vector(mode = "list", length = length(nums)) for(i in seq_along(nums)) { diff_len[[i]] <- sample(rep(c("Fire", "Water"), nums[i]), replace = TRUE) } ############################## ### Base ifelse() function ### ############################## time <- vector(mode = "list", length = length(nums)) for(i in seq_along(nums)) { system.time( base::ifelse(diff_len[[i]] == "Fire", "hot", "cold") ) -> time[[i]] } time %>% purrr::map([
, 1) %>% purrr::flatten_dbl() %>% unname() -> base_vec ################ ### for loop ### ################ if_else_statement <- function(vec) { for(i in seq_along(vec)) { if(vec[i] == "Fire") { vec[i] = "hot" } else { vec[i] = "cold" } } return(vec) } for(i in seq_along(nums)) { system.time( if_else_statement(diff_len[[i]]) ) -> time[[i]] } time %>% purrr::map([
, 1) %>% purrr::flatten_dbl() %>% unname() -> for_loop ########################### ### Dirty vectorization ### ########################### if_else_statement <- function(vec_element) { if(vec_element == "Fire") { vec_element = "hot" } else { vec_element = "cold" } return(vec_element) } vectorize_if_else <- base::Vectorize(if_else_statement) for(i in seq_along(nums)) { system.time( vectorize_if_else(diff_len[[i]]) ) -> time[[i]] } time %>% purrr::map([
, 1) %>% purrr::flatten_dbl() %>% unname() -> dirty_vec
# results data.frame(base_vec = base_vec, for_loop = for_loop, dirty_vec = dirty_vec, n = seq(100000, 1000000, by = 100000)) %>% tidyr::gather(base_vec:dirty_vec, key = "methods", value = "time") %>% dplyr::as_tibble() %>% ggplot(aes(x = n, y = time, col = methods)) + geom_point() + geom_line() + ylab("Time (sec)") + xlab("N")
base::vectorize()
function does not give us any performance improvements. It’s useful if you want a quick and dirty way of making a vectorized function.Recent Posts
Recent Comments
- Kardiana on The Lasso – R Tutorial (Part 3)
- Pascal Schmidt on RSelenium Tutorial: A Tutorial to Basic Web Scraping With RSelenium
- Pascal Schmidt on Dynamic Tabs, insertTab, and removeTab For More efficient R Shiny Applications
- Gisa on Persistent Data Storage With a MySQL Database in R Shiny – An Example App
- Nicholas on Dynamic Tabs, insertTab, and removeTab For More efficient R Shiny Applications