Speeding up For Loops in R With Vectorization, Rcpp, and C++ Loops
June 15, 2019 By Pascal Schmidt programming R
Currently I am working at Statistics Canada with administrative data. Therefore, the data sets are a lot larger than at my previous job at the BC Cancer Agency. Hence, I often run into trouble when doing data manipulation by using loops. With ~2,000,000 observations, loops in R can become ridiculously slow and a real pain.
In this tutorial, I am going over one of the problems I ran into work and how we can speed up the process.
What we will be covering:
- For loop with
data.frame
- For loop with vectors
- For loop with a matrix
- Vectorization
- C++ for loop with Rcpp
First, let’s set the seed for reproducibility and load the required libraries.
set.seed(1234) library(tidyverse) library(Rcpp)
Then we will be creating the toy data set with an ID column and a permit column. The goal is to create a new column which shows the combination of permits for each individual person.
Preparing ouR Toy Data Set
data.frame(ID = sample(1:50000, 100000, replace = TRUE), permit = sample(c("student", "work", "refugee"), 100000, replace = TRUE)) %>% dplyr::distinct(ID, permit) %>% dplyr::arrange(ID) -> df head(df) ## ID permit ## 1 1 refugee ## 2 2 work ## 3 2 refugee ## 4 2 student ## 5 4 work ## 6 4 refugee
In the case of ID number 2, this person came as a refugee to Canada and obtained a student and working permit. Let’s see how we can accomplish the required task.
For Loop: Data Frame Versus Vectors
df_2 <- df df_3 <- df df_4 <- df df_5 <- df z <- base::vector(mode = "list", length = 5) system.time( { df$permit <- as.character(df$permit) df$ID <- as.integer(df$ID) df$comb <- df$permit for(i in 2:nrow(df) - 1) { if(df[i, "ID"] == df[i + 1, "ID"]) { df[i + 1, "comb"] <- paste0(df[i, "comb"], "_", df[i + 1, "comb"]) } } df %>% dplyr::mutate(comb = comb, len = nchar(comb)) %>% dplyr::arrange(ID, desc(len)) %>% dplyr::distinct(ID, .keep_all = TRUE) %>% dplyr::select(-len) -> df } ) -> z[[1]] head(df) ## ID permit comb ## 1 1 student work_refugee_student ## 2 2 work refugee_student_work ## 3 3 refugee work_student_refugee ## 4 4 work refugee_student_work ## 5 5 work student_refugee_work ## 6 6 student work_refugee_student
system.time( { comb <- as.character(df_2$permit) id <- as.integer(df_2$ID) for(i in 2:nrow(df_2) - 1) { if(id[i] == id[i + 1]) { comb[i + 1] <- paste0(comb[i], "_", comb[i + 1]) } } df_2 %>% dplyr::mutate(comb = comb, permit = as.character(permit), len = nchar(comb)) %>% dplyr::arrange(ID, desc(len)) %>% dplyr::distinct(ID, .keep_all = TRUE) %>% dplyr::select(-len) -> df_2 } ) -> z[[2]] dplyr::all_equal(df, df_2) ## [1] TRUE
z[c(1, 2)] ## [[1]] ## user system elapsed ## 155.08 147.30 331.66 ## ## [[2]] ## user system elapsed ## 0.56 0.02 0.63
When we work with a data frame, it takes a very long time to complete the for loop. The most inefficiency comes from constantly indexing the data.frame
. Switching to vectors instead of indexing the columns of a data frame speeds up the entire process by a lot.
Next, we are doing the same thing, except with a matrix instead of a data frame. Because all types in a matrix have to be the same, we are converting everything to type character.
Speeding Up For Loops With Matrices
system.time( { df_3$comb <- df_3$permit df_3 %>% dplyr::mutate_all(as.character) %>% as.matrix() -> df_3 for(i in 2:nrow(df_3) - 1) { if(df_3[i, "ID"] == df_3[i + 1, "ID"]) { df_3[i + 1, "comb"] <- paste0(df_3[i, "comb"], "_", df_3[i + 1, "comb"]) } } as.data.frame(df_3) %>% dplyr::mutate(comb = as.character(comb), ID = as.character(ID) %>% as.integer(), permit = as.character(permit), len = nchar(comb)) %>% dplyr::arrange(ID, desc(len)) %>% dplyr::distinct(ID, .keep_all = TRUE) %>% dplyr::select(-len) -> df_3 } ) -> z[[3]] dplyr::all_equal(df_2, df_3) ## [1] TRUE
z[[3]] ## user system elapsed ## 2.11 0.01 2.25
We can see that indexing a matrix is a lot more efficient than indexing a data frame. One thing I stumbled upon was converting the ID column after as.data.frame(df_3)
from a factor to an integer in the right way. Because factors are stored as integers in R memory, we have to convert ID
to type character first before we can convert it to type integer. This is a common pitfall and has happened to me a lot. Always be aware of that issue in R.
Vectorization Instead of Loops
In the code below, we are vectorizing our operations in the pipe with ifelse()
and paste()
. Both of these functions are vectorized which will make our code very efficient and fast.
For vectorizing operations, I found the dplyr
function lag()
to be very helpful. Let’s check the speed of the operations below.
system.time( df_4 %>% dplyr::mutate(ID_comb = dplyr::lag(ID), ID_comb_2 = dplyr::lag(ID, 2), permit_comb = dplyr::lag(permit), permit_comb_2 = dplyr::lag(permit, 2), comb = base::ifelse(ID == ID_comb, base::paste0(permit, "_", permit_comb), as.character(permit)), comb = base::ifelse(ID == ID_comb_2, base::paste0(permit, "_", permit_comb, "_", permit_comb_2), comb), len = nchar(comb)) %>% dplyr::arrange(ID, desc(len)) %>% dplyr::mutate(comb = c("refugee", .$comb[-1])) %>% dplyr::distinct(ID, .keep_all = TRUE) %>% dplyr::select(-c(ID_comb, ID_comb_2, permit_comb, permit_comb_2, len)) %>% dplyr::mutate(permit = as.character(permit), comb = as.character(comb)) -> df_4 ) -> z[[4]] z[[4]] ## user system elapsed ## 0.56 0.05 0.61
Vectorization is very fast and efficient. We are lucky that ifelse()
and paste()
are vectorized. Otherwise, we could use the base::Vectorize
function to vectorize non-vectorized functions.
Speeding up For Loops in R with C++ and Rcpp
After we have made some improvements to our for loop, we are unleashing the ultimate speed of C++ for loops. This is by far the fastest way to speed up our for loop and even beats vectorization by far.
Rcpp::sourceCpp(here::here("c-plus-plus-fun.cpp")) system.time( { df_5 %>% dplyr::pull(permit) -> chr_permit df_5 %>% dplyr::pull(ID) -> num_id combined_permits(chr_permit, num_id) -> df_cpp } ) -> z[[5]] purrr::set_names(z, c("Data Frame", "Vector", "Matrix", "Vectorization", "C++")) %>% base::do.call(rbind, .) %>% base::as.data.frame() %>% tibble::rownames_to_column(var = "type") %>% dplyr::arrange(user.self) %>% dplyr::select(-c(user.child, sys.child)) %>% dplyr::mutate(prop_speed = round(user.self / .[1, "user.self"], 2)) type user.self sys.self elapsed prop_speed 1 C++ 0.15 0.00 0.15 1.00 2 Vector 0.56 0.02 0.63 3.73 3 Vectorization 0.56 0.05 0.61 3.73 4 Matrix 2.11 0.01 2.25 14.07 5 Data Frame 155.08 147.30 331.66 1033.87
We can see that the C++ for loop is the most efficient way to speed up for loops. It might take a while to get used to writing C++ code. However, the time saved while waiting for your for loops to finish is well worth it.
C++ For Loop
#include <Rcpp.h> using namespace Rcpp; // [[Rcpp::export]] Rcpp::List combined_permits(Rcpp::CharacterVector permit, Rcpp::NumericVector id) { int size = id.length(); Rcpp::String x("_"); for(int i = 0; i < size - 1; i++) { if(id[i] == id[i + 1]) { permit[i + 1] = (permit[i] += x) += permit[i + 1]; } } List df = List::create(_["comb"] = permit); df.attr("class") = "data.frame"; df.attr("row.names") = seq(1, size); return df; }
More Resources
If you have any questions or suggestions, please let me know in the comments below. Thank you!
Recent Posts
Recent Comments
- Kardiana on The Lasso – R Tutorial (Part 3)
- Pascal Schmidt on RSelenium Tutorial: A Tutorial to Basic Web Scraping With RSelenium
- Pascal Schmidt on Dynamic Tabs, insertTab, and removeTab For More efficient R Shiny Applications
- Gisa on Persistent Data Storage With a MySQL Database in R Shiny – An Example App
- Nicholas on Dynamic Tabs, insertTab, and removeTab For More efficient R Shiny Applications