0

I am trying to combine multiple excel files with Python Pandas. Some files have different headers from each other:

Similar question on stackoverflow here

This is where it fails:

# Turn them into dataframes using pandas
frames = []
for excel in excels:
  frame = excel.parse(excel.sheet_names[0],index_col=None)
  frames.append(frame[['Charges', 'Amount','Taxes','Date','Discount Percent', 'Zipcode', 'Order Number']])

KeyError: "['Charges', 'Zipcode', 'Discount Percent'] not in index"

One excel file might have a header but another doesn't and this part of the code fails, how can I make it so if it encounters a header that is not present to just keep going or make the field blank?

The entire script: concat.py

import pandas as pd
import os

excel_path = "C:\\Users\\khernandez\\Desktop\\compare-and-concat\\raw\\"
# File names to join
excel_names = [excel_path + f for f in os.listdir('./raw')]

excels = []
for name in excel_names:
  print("Loading File: " + name)
  excels.append(pd.ExcelFile(name))

# Turn them into dataframes using pandas
frames = []
for excel in excels:
  print("Converting to data frame")
  print(excel)
  frame = excel.parse(excel.sheet_names[0],index_col=None)
  frames.append(frame[['Charges', 'Amount','Taxes','Date','Discount Percent', 'Zipcode', 'Order Number']])


# # Delete the first row of the excel file
# print("Removing HEADERS")
# frames[1:] = [df[1:] for df in frames[1:]]

# Combine the dataframes
print("Combining frames")
combined = pd.concat(frames)


# Write them out to a file named concated.xlsx
combined.to_excel("concated.xlsx", header=True, index=False)

1 Answer 1

1

Typing this in the blind and not fully tested.

You have a fixed set of columns to extract from source Excel files. Use intersection to get only those that exist, then index to add back the missing columns (if any):

frames = []
cols = ['Charges', 'Amount','Taxes','Date','Discount Percent', 'Zipcode', 'Order Number']
for excel in excels:
    ...
    frames.append(frame[np.intersect1d(cols, frame.columns.values)])

combined = pd.concat(frames, sort=False, ignore_index=True) \
                .reindex(cols, axis=0)
Sign up to request clarification or add additional context in comments.

5 Comments

Mirrored your code into mines, AttributeError: 'Series' object has no attribute 'intersection' I looked up the documentation. It seems to be right why isnt it working then?
Sorry, my failure at reading documentation. Changed to np.intersect1d
The script runs but only prints out the headers to my excel file, maybe something in pd.concat()?
I just removed everything but sort=False and it worked!
Glad it worked for you. I though it was the axis value in reindex

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.