Well Well Well my Excel

Well Well Well my Excel

I’m fed up forgetting how to combine lots of excel workbooks into one dataframe, so I’m documenting several ways of doing it, using purrr, data.table, rio and base R.

Importing multiple excel files (not sheets, but files) and flattening into one data frame / tibble / data.table should be straightforward - shouldn’t it? How many ways can there be?

What are the pros and cons of each?

I’ve made three copies of an excel file from a PreppinData challenge.

For this, I just want to import the first sheet from each workbook, into one table. Ideally I’d like to identify the sheets - but it’s not a deal breaker.

suppressPackageStartupMessages(library(rio))
suppressPackageStartupMessages(library(readxl))
suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(purrr))
suppressPackageStartupMessages(library(data.table))


files <- dir(pattern = "*.xlsx")  # in case anything else sneaks into folder

base….how low can you go?

As above, no packages, what can we do with base R?

method_lapply <- lapply(files, read_excel)
method_lapply <- do.call(rbind, Map(data.frame, method_lapply))
head(method_lapply)
##         Date New...Saddles New...Mudguards New...Wheels New...Bags
## 1 2021-01-21            13              42           19         38
## 2 2021-02-21             1               9           14          6
## 3 2021-03-21             8              22            6         35
## 4 2021-04-21             3               9            8         16
## 5 2021-05-21             2               8            5         34
## 6 2021-06-21            11               2            6          8
##   Existing...Saddles Existing...Mudguards Existing...Wheels Existing...Bags
## 1                 17                   48                19              13
## 2                  2                    4                19              24
## 3                  0                   48                17              16
## 4                 18                   50                18              25
## 5                 17                    3                12              19
## 6                  2                    8                 3               1

Is it a data.frame? Yes.
Is it one I’d want to work with? No - those column names are horrible.

Purrr

method_purrr <- map_dfr(files, read_excel, sheet = "Manchester")
str(method_purrr)
## tibble [36 x 9] (S3: tbl_df/tbl/data.frame)
##  $ Date                : POSIXct[1:36], format: "2021-01-21" "2021-02-21" ...
##  $ New - Saddles       : num [1:36] 13 1 8 3 2 11 16 10 15 9 ...
##  $ New - Mudguards     : num [1:36] 42 9 22 9 8 2 5 7 25 11 ...
##  $ New - Wheels        : num [1:36] 19 14 6 8 5 6 15 18 1 11 ...
##  $ New - Bags          : num [1:36] 38 6 35 16 34 8 37 27 38 0 ...
##  $ Existing - Saddles  : num [1:36] 17 2 0 18 17 2 19 10 18 18 ...
##  $ Existing - Mudguards: num [1:36] 48 4 48 50 3 8 1 4 9 10 ...
##  $ Existing - Wheels   : num [1:36] 19 19 17 18 12 3 7 8 0 17 ...
##  $ Existing - Bags     : num [1:36] 13 24 16 25 19 1 28 9 23 7 ...

Let’s verify the sheet argument is working by pulling in the London data instead:

method_purrr2 <- map_dfr(files, read_excel, sheet = "London")
str(method_purrr2)
## tibble [36 x 9] (S3: tbl_df/tbl/data.frame)
##  $ Date                : POSIXct[1:36], format: "2021-01-21" "2021-02-21" ...
##  $ New - Saddles       : num [1:36] 3 19 16 6 5 13 10 16 1 9 ...
##  $ New - Mudguards     : num [1:36] 44 32 52 56 3 6 8 9 32 8 ...
##  $ New - Wheels        : num [1:36] 18 2 2 1 6 16 7 10 2 17 ...
##  $ New - Bags          : num [1:36] 30 38 1 14 36 22 25 2 23 23 ...
##  $ Existing - Saddles  : num [1:36] 5 3 17 18 16 12 20 20 14 18 ...
##  $ Existing - Mudguards: num [1:36] 35 7 0 15 0 4 9 0 36 15 ...
##  $ Existing - Wheels   : num [1:36] 19 17 18 1 1 18 3 7 17 8 ...
##  $ Existing - Bags     : num [1:36] 19 14 14 9 26 20 12 6 23 36 ...

If I want to identify the source workbook, how do I do that? Turns out I need to specify the .id argument, and I need to give it a name wrapped in quotes.

This fails, because I didn’t wrap source_wb ( for source workbook) in quotes :

method_purrr2 <- map_dfr(files, read_excel, sheet = "London", .id = source_wb)

The .id argument needs to be quoted. This now works, and it returns a character vector to identify the source - although these are not the actual names of the workbooks, so still not ideal.

method_purrr2 <- map_dfr(files, read_excel, sheet = "London", .id = "source_wb")
method_purrr2 %>% select(source_wb) %>% distinct()
## # A tibble: 3 x 1
##   source_wb
##   <chr>    
## 1 1        
## 2 2        
## 3 3

This shows I have 3 unique source workbook values, and that the structure is as expected.

str(method_purrr2)
## tibble [36 x 10] (S3: tbl_df/tbl/data.frame)
##  $ source_wb           : chr [1:36] "1" "1" "1" "1" ...
##  $ Date                : POSIXct[1:36], format: "2021-01-21" "2021-02-21" ...
##  $ New - Saddles       : num [1:36] 3 19 16 6 5 13 10 16 1 9 ...
##  $ New - Mudguards     : num [1:36] 44 32 52 56 3 6 8 9 32 8 ...
##  $ New - Wheels        : num [1:36] 18 2 2 1 6 16 7 10 2 17 ...
##  $ New - Bags          : num [1:36] 30 38 1 14 36 22 25 2 23 23 ...
##  $ Existing - Saddles  : num [1:36] 5 3 17 18 16 12 20 20 14 18 ...
##  $ Existing - Mudguards: num [1:36] 35 7 0 15 0 4 9 0 36 15 ...
##  $ Existing - Wheels   : num [1:36] 19 17 18 1 1 18 3 7 17 8 ...
##  $ Existing - Bags     : num [1:36] 19 14 14 9 26 20 12 6 23 36 ...

How do I get the actual workbook names? I need to pipe the files vector to set_names and then onto map_dfr

method_purrr3 <- files %>% 
set_names() %>% 
map_dfr(read_excel, sheet = "Manchester",.id = "source_wb")
str(method_purrr3)
## tibble [36 x 10] (S3: tbl_df/tbl/data.frame)
##  $ source_wb           : chr [1:36] "copy2.xlsx" "copy2.xlsx" "copy2.xlsx" "copy2.xlsx" ...
##  $ Date                : POSIXct[1:36], format: "2021-01-21" "2021-02-21" ...
##  $ New - Saddles       : num [1:36] 13 1 8 3 2 11 16 10 15 9 ...
##  $ New - Mudguards     : num [1:36] 42 9 22 9 8 2 5 7 25 11 ...
##  $ New - Wheels        : num [1:36] 19 14 6 8 5 6 15 18 1 11 ...
##  $ New - Bags          : num [1:36] 38 6 35 16 34 8 37 27 38 0 ...
##  $ Existing - Saddles  : num [1:36] 17 2 0 18 17 2 19 10 18 18 ...
##  $ Existing - Mudguards: num [1:36] 48 4 48 50 3 8 1 4 9 10 ...
##  $ Existing - Wheels   : num [1:36] 19 19 17 18 12 3 7 8 0 17 ...
##  $ Existing - Bags     : num [1:36] 13 24 16 25 19 1 28 9 23 7 ...

And let’s verify that those (admittedly horrible) source workbook names are coming through correctly

method_purrr3 %>% select(source_wb) %>% distinct()
## # A tibble: 3 x 1
##   source_wb              
##   <chr>                  
## 1 copy2.xlsx             
## 2 copy3.xlsx             
## 3 PD 2021 Wk 4 Input.xlsx

for loop and data.table’s rbindlist()

filecount <- as.numeric(length(files))  
temp_list <- list()

for (i in seq_along(files)) {
  filename <- files[i] 
  df <- read_excel(path = filename, sheet = "Manchester")
  df$source_wb <- filename
  temp_list[[i]] <- df
  rm(df)
 method_datatable <- data.table::rbindlist(temp_list, fill = TRUE) 
} 
rm(temp_list)
str(method_datatable)
## Classes 'data.table' and 'data.frame':   36 obs. of  10 variables:
##  $ Date                : POSIXct, format: "2021-01-21" "2021-02-21" ...
##  $ New - Saddles       : num  13 1 8 3 2 11 16 10 15 9 ...
##  $ New - Mudguards     : num  42 9 22 9 8 2 5 7 25 11 ...
##  $ New - Wheels        : num  19 14 6 8 5 6 15 18 1 11 ...
##  $ New - Bags          : num  38 6 35 16 34 8 37 27 38 0 ...
##  $ Existing - Saddles  : num  17 2 0 18 17 2 19 10 18 18 ...
##  $ Existing - Mudguards: num  48 4 48 50 3 8 1 4 9 10 ...
##  $ Existing - Wheels   : num  19 19 17 18 12 3 7 8 0 17 ...
##  $ Existing - Bags     : num  13 24 16 25 19 1 28 9 23 7 ...
##  $ source_wb           : chr  "copy2.xlsx" "copy2.xlsx" "copy2.xlsx" "copy2.xlsx" ...
##  - attr(*, ".internal.selfref")=<externalptr>

Ugly, but effective. This is fast, as you’d expect, and we get the source workbook name along with the data.

rio

Not only is rio the misspelled name of my favourite Japanese, Liverpool supporting rstats football guru, but it also excels at - well, importing excel.
Check it out - the source file name comes through with no additional code required from us.

method_rio <-  import_list(files, rbind = TRUE, sheet = "Manchester")
str(method_rio)
## 'data.frame':    36 obs. of  10 variables:
##  $ Date                : POSIXct, format: "2021-01-21" "2021-02-21" ...
##  $ New - Saddles       : num  13 1 8 3 2 11 16 10 15 9 ...
##  $ New - Mudguards     : num  42 9 22 9 8 2 5 7 25 11 ...
##  $ New - Wheels        : num  19 14 6 8 5 6 15 18 1 11 ...
##  $ New - Bags          : num  38 6 35 16 34 8 37 27 38 0 ...
##  $ Existing - Saddles  : num  17 2 0 18 17 2 19 10 18 18 ...
##  $ Existing - Mudguards: num  48 4 48 50 3 8 1 4 9 10 ...
##  $ Existing - Wheels   : num  19 19 17 18 12 3 7 8 0 17 ...
##  $ Existing - Bags     : num  13 24 16 25 19 1 28 9 23 7 ...
##  $ _file               : chr  "copy2.xlsx" "copy2.xlsx" "copy2.xlsx" "copy2.xlsx" ...
##  - attr(*, ".internal.selfref")=<externalptr>

Yes, it works, it’s quick, it identifies the source, and it’s ridiculously easy. We can also specify a name for the identity column:

method_rio2 <-  import_list(files, rbind = TRUE, rbind_label = "source_wb", sheet = "London")
str(method_rio2)
## 'data.frame':    36 obs. of  10 variables:
##  $ Date                : POSIXct, format: "2021-01-21" "2021-02-21" ...
##  $ New - Saddles       : num  3 19 16 6 5 13 10 16 1 9 ...
##  $ New - Mudguards     : num  44 32 52 56 3 6 8 9 32 8 ...
##  $ New - Wheels        : num  18 2 2 1 6 16 7 10 2 17 ...
##  $ New - Bags          : num  30 38 1 14 36 22 25 2 23 23 ...
##  $ Existing - Saddles  : num  5 3 17 18 16 12 20 20 14 18 ...
##  $ Existing - Mudguards: num  35 7 0 15 0 4 9 0 36 15 ...
##  $ Existing - Wheels   : num  19 17 18 1 1 18 3 7 17 8 ...
##  $ Existing - Bags     : num  19 14 14 9 26 20 12 6 23 36 ...
##  $ source_wb           : chr  "copy2.xlsx" "copy2.xlsx" "copy2.xlsx" "copy2.xlsx" ...
##  - attr(*, ".internal.selfref")=<externalptr>

So there you are, several methods of importing multiple excel files, take your pick, I have to say, the rio function is rather sweet and might be the one I use in future.

purrrr’s map_dfr is also pretty cool, once you realise that the additional arguments to the read_excel function don’t go inside the brackets of the call to read_excel.

Hope this has been useful / helpful to some of you. If so, let me know.

The source files are here if you want to try this yourself


© 2016 - 2021. All rights reserved.