Speed up tidyverse analysis with dtplyr
March 21, 2023
I’ve got a ~15 million rows dataset that I need to do cleaning on. I’m a big tidyverse fan, but dplyr
is slower than data.table
.
Well, TIL about dtplyr, which lets you write dplyr
code but gain the speed of data.table
:
library(data.table)
library(dtplyr)
library(dplyr, warn.conflicts=FALSE)
data_lazy <- data %>%
lazy_dt(immutable=FALSE)
data_lazy %>%
mutate(...) %>%
group_by(column) %>%
summarize(...) %>%
as_tibble()
Take a look at the immutable
argument in the docs. This runs soooo much faster.
Pair that with a previous TIL about caching R code. Boom.