Python: Read in Multiple Excel Workbooks into one DataFrame

Question

I have approximately 150 different workbooks (xlsx) in a folder that I would like to read into a python dataframe for analysis.

Each workbook is set up identically with the same sheet names and column names.

I would need to upload the first sheet of each workbook ("Keywords Rankings") to each DataFrame. For the first worksheet read in, I would want to start on row 11 to maintain the column headers; every worksheet after that I would want to append to my DataFrame starting on row 12.

I am new to Python and have been reading some instructions online but am stuck. From my understanding, I could use the xlrd library to facilitate this.

I've been playing around with the below code but haven't gotten far. 'Keywords Rankings' is the sheet name I want to append.

import pandas as pd
import numpy as np
import glob as glob

all_data = pd.DataFrame()
all_data = pd.ExcelFile("C:\\Users\\John Smith\\Documents\\Analysis\\FPR Nov - Mar 2018\\Dec_1_General.xlsx")
print(all_data.sheet_names)
all_d = all_data.parse('Keywords Rankings')

for f in glob.glob("Users\\John Smith\\Documents\\Analysis\\FPR Nov - Mar 2018\\*.xlsx", recursive=True):
    df = pd.read_excel(f)
    all_d = all_d.append(df,ignore_index=True)

jpp · Accepted Answer · 2018-04-16 17:01:51Z

3

You should not continually append to an existing pd.DataFrame, as this will be extremely inefficient.

You should use pandas.concat with a list of dataframes.

This can be facilitated by a list comprehension:

df = pd.concat([pd.read_excel(f, skiprows=range(10)) for f in files], axis=0)

Columns will automatically align, assuming that headers are present in each Excel worksheet in row 11.

answered Apr 16, 2018 at 17:01

jpp

166k37 gold badges301 silver badges362 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Dys_Lexi_A Over a year ago

When I create a variable for files {files = "Documents\Analysis\FPR Nov - Mar 2018*"}, I get an error stating that "FileNotFoundError: [Errno 2] No such file or directory: 'D'" I have checked and my current directory is correct. am I supposed to input something different for the file variable?

jpp Over a year ago

files should be a list of full paths to your files. You have only included folder names. So, yes, you should look up how to retrieve full paths.

Dys_Lexi_A Over a year ago

I've tried with both the full path name starting from C: drive and the partial path name. I have been using "\*" at the end to indicate that I want all files within the final folder. Is that the correct notation?

jpp Over a year ago

I don't know. There are many questions on SO on how to extract filenames using standard libraries, I suggest you look them up.

Collectives™ on Stack Overflow

Python: Read in Multiple Excel Workbooks into one DataFrame

1 Answer 1

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related