0

I can combined 2 csv scripts and it works well.

import pandas

csv1=pandas.read_csv('1.csv')
csv2=pandas.read_csv('2.csv')
merged=csv1.merge(csv2,on='field1')
merged.to_csv('output.csv',index=False)

Now, I would like to combine more than 2 csvs using the same method as above. I have list of CSV which I defined to something like this

import pandas
collection=['1.csv','2.csv','3.csv','4.csv']
for i in collection:
  csv=pandas.read_csv(i)
  merged=csv.merge(??,on='field1')
  merged.to_csv('output2.csv',index=False)

I havent got it work so far if more than 1 csv..I guess it just a matter iterate inside the list ..any idea?

3
  • Are you using merge for a SQL-style inner join? Or could you possible concat instead? Commented Apr 10, 2015 at 13:46
  • Too bad I dont have an access to the SQL DB. The one given in csv unfortunately :( Commented Apr 10, 2015 at 14:10
  • I see. I'm just trying to understand the type of join you are doing; if it's inner join, then sticking with merge is good, but if you can do concat, the code would be a lot simpler. Commented Apr 10, 2015 at 14:40

1 Answer 1

1

You need special handling for the first loop iteration:

import pandas
collection=['1.csv','2.csv','3.csv','4.csv']

result = None
for i in collection:
  csv=pandas.read_csv(i)
  if result is None:
    result = csv
  else:
    result = result.merge(csv, on='field1')

if result:
  result.to_csv('output2.csv',index=False)

Another alternative would be to load the first CSV outside the loop but this breaks when the collection is empty:

import pandas
collection=['1.csv','2.csv','3.csv','4.csv']

result = pandas.read_csv(collection[0])
for i in collection[1:]:
  csv = pandas.read_csv(i)
  result = result.merge(csv, on='field1')

if result:
  result.to_csv('output2.csv',index=False)

I don't know how to create an empty document (?) in pandas but that would work, too:

import pandas
collection=['1.csv','2.csv','3.csv','4.csv']

result = pandas.create_empty() # not sure how to do this
for i in collection:
  csv = pandas.read_csv(i)
  result = result.merge(csv, on='field1')

result.to_csv('output2.csv',index=False)
Sign up to request clarification or add additional context in comments.

3 Comments

Thanks. I think its getting nearer. I am getting this error ----"Traceback (most recent call last): File "merge.py", line 11, in <module> if result.all(): File "/Library/Python/2.7/site-packages/pandas/core/generic.py", line 709, in nonzero .format(self.__class__.__name__)) ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()"-----
I don't know enough about Panda to help you there. Ask a new question and include the data record(s) which cause the error and your code.
I just removed if result: statement for second sample, and it works. Apparently ValueError in Pandas play a role here. pandas.pydata.org/pandas-docs/version/0.15.2/gotchas.html Thanks Aaron.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.