Read multiple xlsx files with multiple sheets into one R data frame

Question

I have been reading up on how to read and combine multiple xlsx files into one R data frame and have come across some very good suggestions like, How to read multiple xlsx file in R using loop with specific rows and columns, but non fits my data set so far.

I would like R to read in multiple xlsx files with that have multiple sheets. All sheets and files have the same columns but not the same length and NA's should be excluded. I want to skip the first 3 rows and only take in columns 1:6, 8:10, 12:17, 19.

So far I tried:

file.list <- list.files(recursive=T,pattern='*.xlsx')

dat = lapply(file.list, function(i){
    x = read.xlsx(i, sheetIndex=1, sheetName=NULL, startRow=4,
              endRow=NULL, as.data.frame=TRUE, header=F)
# Column select 
    x = x[, c(1:6,8:10,12:17,19)]
# Create column with file name  
    x$file = i
# Return data
    x
  })

  dat = do.call("rbind.data.frame", dat)

But this only takes all the first sheet of every file

Does anyone know how to get all the sheets and files together in one R data frame?

Also, what packages would you recommend for large sets of data? So far I tried readxl and XLConnect.

You have explicitly asked for only the first sheet in your function: x = read.xlsx(i, sheetIndex=1,....) — mkt
– mkt, Commented Jul 5, 2016 at 7:48
Also, if you're looking to optimize speed for large datasets, it's worth looking up the data.table package. Among other things, its fread function allows you to only read in the columns that you need, instead of reading all columns and then subsetting. But I'm not sure that it will work with .xlsx files. — mkt
– mkt, Commented Jul 5, 2016 at 7:49
Your lapply has looped over files, you need to ass a second loop over sheets to get what you want. — Choubi
– Choubi, Commented Jul 5, 2016 at 7:50
Thanks for the suggestion. Do you know if there is a way to ask for all sheets with with read.xlsx? — Elisah
– Elisah, Commented Jul 5, 2016 at 9:08

Andre Elrico · Accepted Answer · 2017-11-24 13:30:58Z

7

openxlsx solution:

filename <-"myFilePath"

sheets <- openxlsx::getSheetNames(filename)
SheetList <- lapply(sheets,openxlsx::read.xlsx,xlsxFile=filename)
names(SheetList) <- sheets

edited Nov 24, 2017 at 13:30

answered Sep 28, 2017 at 14:00

Andre Elrico

11.6k8 gold badges56 silver badges78 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

SeanM Over a year ago

I think this is a faster and more elegant solution!! Thanks!

sbha · Accepted Answer · 2018-07-18 13:15:59Z

Here's a tidyverse and readxl driven option that returns a data frame with columns for file and sheet names for each file.

In this example, not every file has the same sheets or columns; test2.xlsx has only one sheet and test3.xlsx sheet1 does not have col3.

library(tidyverse)
library(readxl)

dir_path <- "~/test_dir/"         # target directory path where the xlsx files are located. 
re_file <- "^test[0-9]\\.xlsx"    # regex pattern to match the file name format, in this case 'test1.xlsx', 'test2.xlsx' etc, but could simply be 'xlsx'.

read_sheets <- function(dir_path, file){
  xlsx_file <- paste0(dir_path, file)
  xlsx_file %>%
    excel_sheets() %>%
    set_names() %>%
    map_df(read_excel, path = xlsx_file, .id = 'sheet_name') %>% 
    mutate(file_name = file) %>% 
    select(file_name, sheet_name, everything())
}

df <- list.files(dir_path, re_file) %>% 
  map_df(~ read_sheets(dir_path, .))

# A tibble: 15 x 5
   file_name  sheet_name  col1  col2  col3
   <chr>      <chr>      <dbl> <dbl> <dbl>
 1 test1.xlsx Sheet1         1     2     4
 2 test1.xlsx Sheet1         3     2     3
 3 test1.xlsx Sheet1         2     4     4
 4 test1.xlsx Sheet2         3     3     1
 5 test1.xlsx Sheet2         2     2     2
 6 test1.xlsx Sheet2         4     3     4
 7 test2.xlsx Sheet1         1     3     5
 8 test2.xlsx Sheet1         4     4     3
 9 test2.xlsx Sheet1         1     2     2
10 test3.xlsx Sheet1         3     9    NA
11 test3.xlsx Sheet1         4     7    NA
12 test3.xlsx Sheet1         5     3    NA
13 test3.xlsx Sheet2         1     3     4
14 test3.xlsx Sheet2         2     5     9
15 test3.xlsx Sheet2         4     3     1

GPierre · Accepted Answer · 2016-07-05 08:04:59Z

4

I would use a nested loop like this to go through each sheet of each file. It might not be the fastest solution but it is the simplest.

require(xlsx)    
file.list <- list.files(recursive=T,pattern='*.xlsx')  #get files list from folder

for (i in 1:length(files.list)){                                           
  wb <- loadWorkbook(files.list[i])           #select a file & load workbook
  sheet <- getSheets(wb)                      #get sheet list

  for (j in 1:length(sheet)){ 
    tmp<-read.xlsx(files.list[i], sheetIndex=j, colIndex= c(1:6,8:10,12:17,19),
                   sheetName=NULL, startRow=4, endRow=NULL,
                   as.data.frame=TRUE, header=F)   
    if (i==1&j==1) dataset<-tmp else dataset<-rbind(dataset,tmp)   #happend to previous

  }
}

You can clean NA values after the loading phase.

edited Jul 5, 2016 at 8:04

answered Jul 5, 2016 at 7:50

GPierre

9239 silver badges25 bronze badges

5 Comments

Elisah Over a year ago

Thanks very much! It's doesn't work for me quit well yet. I'm getting a warning: Error in charToDate(x) : character string is not in a standard unambiguous format

GPierre Over a year ago

This sounds like an error you would get after loading everything, when formatting your date variable. With that error, you usually have to manipulate your date variable to get it to a standard format. See this post.

GPierre Over a year ago

It could also come from R trying to guess the class of your columns and wrongly assume it is a date (see ? read.xlsx). In that case, I would suggest to use the colClasses argument to force all columns to be loaded as strings. And clean the data after everything is loaded (give correct classes, NA..) .

Roman Over a year ago

Instead of using two loops you could replace the second by a sapply function such as do.call("rbind", sapply(sheet, function(x, y) read.xlsx(y, x), files.list[i]))

coip Over a year ago

Shouldn't files.list in your for-loop by file.list instead?

rane · Accepted Answer · 2019-04-22 07:18:08Z

0

One more solution from this "rio" package :

library("rio")

# import and rbind all worksheets
DT <- import_list(SINGLE_XLSX_PATH, rbind = TRUE)

source : rdrr.io

answered Apr 22, 2019 at 7:18

rane

9314 gold badges13 silver badges27 bronze badges

Collectives™ on Stack Overflow

Read multiple xlsx files with multiple sheets into one R data frame

4 Answers 4

1 Comment

Comments

5 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

1 Comment

Comments

5 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related