1

I have a master csv file in the form

col1, col2, col3, col4...
a,    x,    y,    z
a,    x,    y,    z
b,    x,    y,    z
b,    x,    y,    z
..    ..    ..    ..

and I want to read this file in. Create a new Excel file with all values where col1==a and another file with all values where col1==b. So OutputFilea will look like:

col1, col2, col3, col4...
a,    x,    y,    z
a,    x,    y,    z

and OutputFileb will look like

col1, col2, col3, col4...
b,    x,    y,    z
b,    x,    y,    z

My question is, should I use csv.reader() line by line and use conditionals to determine which file should be appended or should I append a string with the rows and then write each file at the end. Or is there a module which optimizes a process like this?

7
  • 1
    What are your criteria for which approach is best? It sounds like all of them are reasonable approaches, making this a matter purely of opinion. Commented Jul 11, 2017 at 18:30
  • That, and the fact that you haven't actually attempted to implement any of the approaches enough to run into any concrete problems... Commented Jul 11, 2017 at 18:31
  • @MadPhysicist I will be implementing this on a large data set and do not know if these methods will be too slow or memory inefficient when that time comes. Commented Jul 11, 2017 at 18:33
  • The implementations are nearly trivial. You can try them all out before the time comes with very little effort. If you have enormous data sets, it should be apparent that holding everything in memory and writing out at the end is not a good option. Commented Jul 11, 2017 at 18:34
  • I will write up an answer with some optimizations for you. Commented Jul 11, 2017 at 18:35

2 Answers 2

3

Since you are going to be working with large data sets, it is probably best not to hold too much in memory at the same time. You can maintain a dictionary of open files keyed by the line prefix, and make sure that the files are closed properly using an contextlib.ExitStack. Doing this will allow you to open new files lazily as you process the input spreadsheet:

from contextlib import ExitStack

output_files = {}
with open('master.csv', 'r') as master, ExitStack() as output_stack:
    for line in master:
        prefix = line.split(',', 1)[0]
        if prefix not in output_files:
            output_name = 'output' + prefix + '.csv'
            output = output_stack.enter_context(open(output_name, 'w'))
            output_files[prefix] = output
        else:
            output = output_files[prefix]
        print(line, file=output)

Given that you want to copy the lines as-is into the output files, I have chosen not to use the csv module. If you want to apply more complex processing, you should probably consider adding it in of course.

Sign up to request clarification or add additional context in comments.

Comments

2

I would suggest to try pandas for this kind of stuff. There is a special function to write to excel. In this case imagine I read your .csv file into a pandas dataframe df:

In [4]: df = pd.read_csv('yourfile.csv')

In [5]: df
Out[5]: 
  col1   col2   col3   col4
0    a      x      y      z
1    a      x      y      z
2    b      x      y      z
3    b      x      y      z

Then I can select only the values I want to filter and save to excel:

In [6]: dfa = df[df['col1']=='a']

In [7]: dfa
Out[7]: 
  col1   col2   col3   col4
0    a      x      y      z
1    a      x      y      z

In [8]: dfa.to_excel('OutputFilea.xls')

The same happens with the second filter:

In [9]: dfb = df[df['col1']=='b']

In [10]: dfb.to_excel('OutputFileb.xls')

Hope that helps.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.