collapse might rock your world

collapse is a new force in the data transformation stakes. In this post, I try out one simple function and compare it to data.table and dplyr
The NHS-R Community site prompted this post, specifically a question to the help-with-r channel on Slack.
Someone asked about obtaining unique values in a 60M row / 6 column dataset.
Naturally, data.table came up, and a few of us made suggestions for improvement, but we didn’t seem to be hitting on a magic solution.
Without being able to get my hands on the actual data, I decided to have play around at home.
I don’t know what the real life data strucure was, but this is the simple 4 column data set I produced initially, named testdata
:
str(testdata)
# Classes ‘data.table’ and 'data.frame': 60000000 obs. of 4 variables:
# $ RANDOM_STRING: chr "aaaaa" "aaaaa" "aaaaa" "aaaaa" ...
# $ START_DATE : Date, format: "2015-01-10" "2015-01-11" "2015-01-13" "2015-01-24" ...
# $ randint : int 1 4 4 5 5 3 8 9 4 5 ...
# $ EXPIRY_DATE : Date, format: "2015-11-02" "2015-10-29" "2015-10-27" "2015-09-11" ...
# - attr(*, ".internal.selfref")=<externalptr>
# - attr(*, "sorted")= chr [1:4] "RANDOM_STRING" "START_DATE" "EXPIRY_DATE" "randint"
#
This was created and saved to an RDS file, and converted to a keyed data.table as part of the process.
haskey(testdata)
#[1] TRUE
First up, calling data.table::unique
system.time(uniqdat <- unique(testdata))
Timings to return 5993540 unique observations
##############################
# user system elapsed
# 20.01 10.52 37.42
##############################
Does it make any difference if I don’t assign to a variable ?
system.time(unique(testdata))
##############################
# user system elapsed
# 17.88 8.50 25.47
##############################
Looks like it does.
Now, in the real life example, the analyst is seeing severe issues with timing, in terms of hours rather than seconds. I was researching alternative solutions and chanced upon the collapse package:
It’s pretty amazing. After a couple of false starts because I didn’t read the documentation, I successfully called the function - you have to specify the columns you want to find unique values for.
system.time(uniqdat2 <- funique(testdata,cols = 1:4))
##############################
# user system elapsed
# 4.33 1.74 8.23
##############################
Testing without assignment:
system.time(funique(testdata,cols = 1:4))
##############################
# user system elapsed
# 4.66 2.60 8.00
##############################
Pretty cool - a huge reduction in time.
From the introductory vignette:
“collapse is a C/C++ based package for data transformation and statistical computing in R. It’s aims are:
To facilitate complex data transformation, exploration and computing tasks in R. To help make R code fast, flexible, parsimonious and programmer friendly.”
I look forward to getting more familiar with this package.
Finally, out of interest, I tried dplyr timings. I had already seen that dplyr often outperforms data.table for finding unique / distinct values:
# dplyr for thoroughness
system.time(uniqdat3 <- distinct(testdata))
##############################
# user system elapsed
# 14.87 2.86 20.82
##############################
Not as fast as collapse, but faster than data.table.
I also tried without assigining to a variable, but each time I got ‘out of memory’ errors.
My personal laptop is a bit underpowered for this dataset (only 8GB RAM) and it couldn’t handle it.
So now I am in the process of upgrading my laptop ( it’s easier to swap out the SSD and RAM than get a new one), which will give me double the power, at which point I might try this again.
I’m impressed with collapse, it’s a really nice package (just look at the detail in the logo, you can tell this is the real deal) and I can’t wait to become more conversant with its functionality.
Post publishing update
I upgraded my RAM (now 16GB) and repeated the exercise:
Timings are as follows:
# user system elapsed
# 19.31 1.46 14.82 data.table (with assignment)
# 18.38 1.64 13.72 data.table (no assignment)
# 4.70 0.59 5.42 collapse (with assignment)
# 4.36 0.50 4.92 collapse (no assignment)
# 13.52 0.33 14.18 dplyr (with assignment)
# 12.65 0.53 13.45 dplyr (no assignment)