collapse might rock your world (Last updated: 2025-05-05)

collapse might rock your world

collapse is a new force in the data transformation stakes. In this post, I try out one simple function and compare it to data.table and dplyr

The NHS-R Community site prompted this post, specifically a question to the help-with-r channel on Slack.

Someone asked about obtaining unique values in a 60M row / 6 column dataset.

Naturally, data.table came up, and a few of us made suggestions for improvement, but we didn’t seem to be hitting on a magic solution.

Without being able to get my hands on the actual data, I decided to have play around at home.

I don’t know what the real life data strucure was, but this is the simple 4 column data set I produced initially, named testdata :

str(testdata) # Classes ‘data.table’ and 'data.frame':	60000000 obs. of  4 variables: #     $ RANDOM_STRING: chr  "aaaaa" "aaaaa" "aaaaa" "aaaaa" ... # $ START_DATE   : Date, format: "2015-01-10" "2015-01-11" "2015-01-13" "2015-01-24" ... # $ randint      : int  1 4 4 5 5 3 8 9 4 5 ... # $ EXPIRY_DATE  : Date, format: "2015-11-02" "2015-10-29" "2015-10-27" "2015-09-11" ... # - attr(*, ".internal.selfref")=<externalptr>  #     - attr(*, "sorted")= chr [1:4] "RANDOM_STRING" "START_DATE" "EXPIRY_DATE" "randint" #  

This was created and saved to an RDS file, and converted to a keyed data.table as part of the process.

haskey(testdata) #[1] TRUE 

First up, calling data.table::unique

system.time(uniqdat <- unique(testdata)) 

Timings to return 5993540 unique observations

##############################  # user    system elapsed  # 20.01   10.52   37.42  ############################## 

Does it make any difference if I don’t assign to a variable ?

system.time(unique(testdata)) ##############################  #  user  system elapsed  # 17.88    8.50   25.47  ############################## 

Looks like it does.

Now, in the real life example, the analyst is seeing severe issues with timing, in terms of hours rather than seconds. I was researching alternative solutions and chanced upon the collapse package:

It’s pretty amazing. After a couple of false starts because I didn’t read the documentation, I successfully called the function - you have to specify the columns you want to find unique values for.

system.time(uniqdat2 <- funique(testdata,cols = 1:4)) ##############################  # user  system elapsed  # 4.33    1.74    8.23  ############################## 

Testing without assignment:

system.time(funique(testdata,cols = 1:4)) ##############################  # user    system  elapsed # 4.66    2.60    8.00  ############################## 

Pretty cool - a huge reduction in time.

From the introductory vignette:

“collapse is a C/C++ based package for data transformation and statistical computing in R. It’s aims are:

To facilitate complex data transformation, exploration and computing tasks in R. To help make R code fast, flexible, parsimonious and programmer friendly.”

I look forward to getting more familiar with this package.

Finally, out of interest, I tried dplyr timings. I had already seen that dplyr often outperforms data.table for finding unique / distinct values:

# dplyr for thoroughness system.time(uniqdat3 <- distinct(testdata)) ##############################  # user    system  elapsed # 14.87    2.86   20.82  ############################## 

Not as fast as collapse, but faster than data.table.

I also tried without assigining to a variable, but each time I got ‘out of memory’ errors.

My personal laptop is a bit underpowered for this dataset (only 8GB RAM) and it couldn’t handle it.

So now I am in the process of upgrading my laptop ( it’s easier to swap out the SSD and RAM than get a new one), which will give me double the power, at which point I might try this again.

I’m impressed with collapse, it’s a really nice package (just look at the detail in the logo, you can tell this is the real deal) and I can’t wait to become more conversant with its functionality.

Post publishing update

I upgraded my RAM (now 16GB) and repeated the exercise:

Timings are as follows:

# user  system elapsed  # 19.31    1.46   14.82  data.table (with assignment) # 18.38    1.64   13.72  data.table (no assignment) # 4.70    0.59    5.42   collapse (with assignment) # 4.36    0.50    4.92   collapse (no assignment) # 13.52    0.33   14.18   dplyr (with assignment) # 12.65    0.53   13.45   dplyr (no assignment) 
Hey, just a heads up! We’ve got the latest info all updated for you, and we even put together some extra content specifically for our friends in the Japanese region. If you’re curious and want to dive deeper, feel free to click on the link below to check out the full article! https://www.johnmackintosh.net/jp/article/50/data-processing-acceleration-collapse-package

© 2016 - 2022. All rights reserved.