Speed up tidyverse analysis with dtplyr
I’ve got a ~15 million rows dataset that I need to do cleaning on. I’m a big tidyverse fan, but dplyr is slower than data.table.
Well, TIL about dtplyr, which lets you write dplyr code but gain the speed of data.table:
library(data.table)
library(dtplyr)
library(dplyr, warn.conflicts=FALSE)
data_lazy <- data %>%
lazy_dt(immutable=FALSE)
data_lazy %>%
mutate(...) %>%
group_by(column) %>%
summarize(...) %>%
as_tibble()
Take a look at the immutable argument in the docs. This runs soooo much faster.
Pair that with a previous TIL about caching R code. Boom.