# collapse might rock your world

collapse is a new force in the data transformation stakes. In this post, I try out one simple function and compare it to data.table and dplyr

The NHS-R Community site prompted this post, specifically a question to the help-with-r channel on Slack.

Someone asked about obtaining unique values in a 60M row / 6 column dataset.

Naturally, data.table came up, and a few of us made suggestions for improvement, but we didn’t seem to be hitting on a magic solution.

Without being able to get my hands on the actual data, I decided to have play around at home.

I don’t know what the real life data strucure was, but this is the simple 4 column data set I produced initially, named testdata :

str(testdata)
# Classes ‘data.table’ and 'data.frame':	60000000 obs. of  4 variables:
#     $RANDOM_STRING: chr "aaaaa" "aaaaa" "aaaaa" "aaaaa" ... #$ START_DATE   : Date, format: "2015-01-10" "2015-01-11" "2015-01-13" "2015-01-24" ...
# $randint : int 1 4 4 5 5 3 8 9 4 5 ... #$ EXPIRY_DATE  : Date, format: "2015-11-02" "2015-10-29" "2015-10-27" "2015-09-11" ...
# - attr(*, ".internal.selfref")=<externalptr>
#     - attr(*, "sorted")= chr [1:4] "RANDOM_STRING" "START_DATE" "EXPIRY_DATE" "randint"
#


This was created and saved to an RDS file, and converted to a keyed data.table as part of the process.

haskey(testdata)
#[1] TRUE


First up, calling data.table::unique

system.time(uniqdat <- unique(testdata))


Timings to return 5993540 unique observations

##############################
# user    system elapsed
# 20.01   10.52   37.42
##############################



Does it make any difference if I don’t assign to a variable ?

system.time(unique(testdata))

##############################
#  user  system elapsed
# 17.88    8.50   25.47
##############################



Looks like it does.

Now, in the real life example, the analyst is seeing severe issues with timing, in terms of hours rather than seconds. I was researching alternative solutions and chanced upon the collapse package:

It’s pretty amazing. After a couple of false starts because I didn’t read the documentation, I successfully called the function - you have to specify the columns you want to find unique values for.

system.time(uniqdat2 <- funique(testdata,cols = 1:4))

##############################
# user  system elapsed
# 4.33    1.74    8.23
##############################



Testing without assignment:

system.time(funique(testdata,cols = 1:4))
##############################
# user    system  elapsed
# 4.66    2.60    8.00
##############################


Pretty cool - a huge reduction in time.

From the introductory vignette:

“collapse is a C/C++ based package for data transformation and statistical computing in R. It’s aims are:

To facilitate complex data transformation, exploration and computing tasks in R. To help make R code fast, flexible, parsimonious and programmer friendly.”

I look forward to getting more familiar with this package.

Finally, out of interest, I tried dplyr timings. I had already seen that dplyr often outperforms data.table for finding unique / distinct values:

# dplyr for thoroughness

system.time(uniqdat3 <- distinct(testdata))
##############################
# user    system  elapsed

# 14.87    2.86   20.82
##############################



Not as fast as collapse, but faster than data.table.

I also tried without assigining to a variable, but each time I got ‘out of memory’ errors.

My personal laptop is a bit underpowered for this dataset (only 8GB RAM) and it couldn’t handle it.

So now I am in the process of upgrading my laptop ( it’s easier to swap out the SSD and RAM than get a new one), which will give me double the power, at which point I might try this again.

I’m impressed with collapse, it’s a really nice package (just look at the detail in the logo, you can tell this is the real deal) and I can’t wait to become more conversant with its functionality.

## Post publishing update

I upgraded my RAM (now 16GB) and repeated the exercise:

Timings are as follows:

# user  system elapsed
# 19.31    1.46   14.82  data.table (with assignment)
# 18.38    1.64   13.72  data.table (no assignment)

# 4.70    0.59    5.42   collapse (with assignment)
# 4.36    0.50    4.92   collapse (no assignment)

# 13.52    0.33   14.18   dplyr (with assignment)
# 12.65    0.53   13.45   dplyr (no assignment)