Speed up tidyverse analysis with dtplyr

March 21, 2023

I’ve got a ~15 million rows dataset that I need to do cleaning on. I’m a big tidyverse fan, but dplyr is slower than data.table.

Well, TIL about dtplyr, which lets you write dplyr code but gain the speed of data.table:

library(data.table)
library(dtplyr)
library(dplyr, warn.conflicts=FALSE)

data_lazy <- data %>%
  lazy_dt(immutable=FALSE)

data_lazy %>%
  mutate(...) %>%
  group_by(column) %>%
  summarize(...) %>%
  as_tibble()

Take a look at the immutable argument in the docs. This runs soooo much faster.

Pair that with a previous TIL about caching R code. Boom.