Speeding up For Loops in R With Vectorization, Rcpp, and C++ Loops

Currently I am working at Statistics Canada with administrative data. Therefore, the data sets are a lot larger than at my previous job at the BC Cancer Agency. Hence, I often run into trouble when doing data manipulation by using loops. With ~2,000,000 observations, loops in R can become ridiculously slow and a real pain.

In this tutorial, I am going over one of the problems I ran into work and how we can speed up the process.

What we will be covering:

  • For loop with data.frame
  • For loop with vectors
  • For loop with a matrix
  • Vectorization
  • C++ for loop with Rcpp

speeding up for loops

First, let’s set the seed for reproducibility and load the required libraries.

set.seed(1234)
library(tidyverse)
library(Rcpp)

Then we will be creating the toy data set with an ID column and a permit column. The goal is to create a new column which shows the combination of permits for each individual person.

Preparing ouR Toy Data Set

data.frame(ID = sample(1:50000, 100000, replace = TRUE),
           permit = sample(c("student", "work", "refugee"), 100000, replace = TRUE)) %>%
  dplyr::distinct(ID, permit) %>%
  dplyr::arrange(ID) -> df

head(df)

##   ID  permit
## 1  1 refugee
## 2  2    work
## 3  2 refugee
## 4  2 student
## 5  4    work
## 6  4 refugee

In the case of ID number 2, this person came as a refugee to Canada and obtained a student and working permit. Let’s see how we can accomplish the required task.

For Loop: Data Frame Versus Vectors

df_2 <- df
df_3 <- df
df_4 <- df
df_5 <- df
z <- base::vector(mode = "list", length = 5)
system.time(
  {
    df$permit <- as.character(df$permit)
    df$ID <- as.integer(df$ID)
    df$comb <- df$permit
    
    for(i in 2:nrow(df) - 1) {
      if(df[i, "ID"] == df[i + 1, "ID"]) {
        df[i + 1, "comb"] <- paste0(df[i, "comb"], "_", df[i + 1, "comb"])
      }
    }
    
    df %>%
      dplyr::mutate(comb = comb,
                    len = nchar(comb)) %>%
      dplyr::arrange(ID, desc(len)) %>%
      dplyr::distinct(ID, .keep_all = TRUE) %>%
      dplyr::select(-len) -> df

  }
) -> z[[1]]

head(df)

##   ID  permit                 comb
## 1  1 student work_refugee_student
## 2  2    work refugee_student_work
## 3  3 refugee work_student_refugee
## 4  4    work refugee_student_work
## 5  5    work student_refugee_work
## 6  6 student work_refugee_student
system.time(
  {
    
    comb <- as.character(df_2$permit)
    id <- as.integer(df_2$ID)
    
    for(i in 2:nrow(df_2) - 1) {
      if(id[i] == id[i + 1]) {
        comb[i + 1] <- paste0(comb[i], "_", comb[i + 1])
      } 
    }
    
    df_2 %>%
      dplyr::mutate(comb = comb,
                    permit = as.character(permit),
                    len = nchar(comb)) %>%
      dplyr::arrange(ID, desc(len)) %>%
      dplyr::distinct(ID, .keep_all = TRUE) %>%
      dplyr::select(-len) -> df_2
    
  }
) -> z[[2]]

dplyr::all_equal(df, df_2)

## [1] TRUE
z[c(1, 2)]

## [[1]]
##    user  system elapsed 
##  155.08  147.30  331.66 
## 
## [[2]]
##    user  system elapsed 
##    0.56    0.02    0.63

When we work with a data frame, it takes a very long time to complete the for loop. The most inefficiency comes from constantly indexing the data.frame. Switching to vectors instead of indexing the columns of a data frame speeds up the entire process by a lot.

Next, we are doing the same thing, except with a matrix instead of a data frame. Because all types in a matrix have to be the same, we are converting everything to type character.

Speeding Up For Loops With Matrices

system.time(
  {
    
    df_3$comb <- df_3$permit
    df_3 %>%
      dplyr::mutate_all(as.character) %>%
      as.matrix() -> df_3
    
    for(i in 2:nrow(df_3) - 1) {
      if(df_3[i, "ID"] == df_3[i + 1, "ID"]) {
        df_3[i + 1, "comb"] <- paste0(df_3[i, "comb"], "_", df_3[i + 1, "comb"])
      }
    }
    
    as.data.frame(df_3) %>%
      dplyr::mutate(comb = as.character(comb),
                    ID = as.character(ID) %>%
                      as.integer(),
                    permit = as.character(permit),
                    len = nchar(comb)) %>%
      dplyr::arrange(ID, desc(len)) %>%
      dplyr::distinct(ID, .keep_all = TRUE) %>%
      dplyr::select(-len) -> df_3

  }
) -> z[[3]]

dplyr::all_equal(df_2, df_3)

## [1] TRUE
z[[3]]

##    user  system elapsed 
##    2.11    0.01    2.25

We can see that indexing a matrix is a lot more efficient than indexing a data frame. One thing I stumbled upon was converting the ID column after as.data.frame(df_3) from a factor to an integer in the right way. Because factors are stored as integers in R memory, we have to convert ID to type character first before we can convert it to type integer. This is a common pitfall and has happened to me a lot. Always be aware of that issue in R.

Vectorization Instead of Loops

In the code below, we are vectorizing our operations in the pipe with ifelse() and paste(). Both of these functions are vectorized which will make our code very efficient and fast.

For vectorizing operations, I found the dplyr function lag() to be very helpful. Let’s check the speed of the operations below.

system.time(
  
  df_4 %>%
    dplyr::mutate(ID_comb = dplyr::lag(ID),
                  ID_comb_2 = dplyr::lag(ID, 2),
                  permit_comb = dplyr::lag(permit),
                  permit_comb_2 = dplyr::lag(permit, 2),
                  comb = base::ifelse(ID == ID_comb, base::paste0(permit, "_", permit_comb), as.character(permit)),
                  comb = base::ifelse(ID == ID_comb_2, base::paste0(permit, "_", permit_comb, "_", permit_comb_2), comb),
                  len = nchar(comb)) %>%
    dplyr::arrange(ID, desc(len)) %>%
    dplyr::mutate(comb = c("refugee", .$comb[-1])) %>%
    dplyr::distinct(ID, .keep_all = TRUE) %>%
    dplyr::select(-c(ID_comb, ID_comb_2, permit_comb, permit_comb_2, len)) %>%
    dplyr::mutate(permit = as.character(permit),
                  comb = as.character(comb)) -> df_4
  
) -> z[[4]]

z[[4]]

##    user  system elapsed 
##    0.56    0.05    0.61

Vectorization is very fast and efficient. We are lucky that ifelse() and paste() are vectorized. Otherwise, we could use the base::Vectorize function to vectorize non-vectorized functions.

Speeding up For Loops in R with C++ and Rcpp

After we have made some improvements to our for loop, we are unleashing the ultimate speed of C++ for loops. This is by far the fastest way to speed up our for loop and even beats vectorization by far.

Rcpp::sourceCpp(here::here("c-plus-plus-fun.cpp"))
system.time(
  {
    df_5 %>%
      dplyr::pull(permit) -> chr_permit
    df_5 %>%
      dplyr::pull(ID) -> num_id
    
    combined_permits(chr_permit, num_id) -> df_cpp
  }
) -> z[[5]]


purrr::set_names(z, c("Data Frame", "Vector", "Matrix", "Vectorization", "C++")) %>%
  base::do.call(rbind, .) %>%
  base::as.data.frame() %>%
  tibble::rownames_to_column(var = "type") %>%
  dplyr::arrange(user.self) %>%
  dplyr::select(-c(user.child, sys.child)) %>%
  dplyr::mutate(prop_speed = round(user.self / .[1, "user.self"], 2))

            type user.self sys.self elapsed prop_speed
 1           C++      0.15     0.00    0.15       1.00
 2        Vector      0.56     0.02    0.63       3.73
 3 Vectorization      0.56     0.05    0.61       3.73
 4        Matrix      2.11     0.01    2.25      14.07
 5    Data Frame    155.08   147.30  331.66    1033.87

We can see that the C++ for loop is the most efficient way to speed up for loops. It might take a while to get used to writing C++ code. However, the time saved while waiting for your for loops to finish is well worth it.

C++ For Loop

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
Rcpp::List combined_permits(Rcpp::CharacterVector permit, 
                            Rcpp::NumericVector id) {
  
  int size = id.length();
  Rcpp::String x("_");
  
  for(int i = 0; i < size - 1; i++)
    
  {
    
    if(id[i] == id[i + 1])
      
    {
      permit[i + 1] = (permit[i] += x) += permit[i + 1];
      
    }
    
  }
  
  List df = List::create(_["comb"] = permit);
  df.attr("class") = "data.frame";
  df.attr("row.names") = seq(1, size);
  return df;
  
}

More Resources

If you have any questions or suggestions, please let me know in the comments below. Thank you!

Post your comment