Splitting a dataframe into separate CSV files

Question

I have a fairly large csv, looking like this:

+---------+---------+
| Column1 | Column2 |
+---------+---------+
|       1 |   93644 |
|       2 |   63246 |
|       3 |   47790 |
|       3 |   39644 |
|       3 |   32585 |
|       1 |   19593 |
|       1 |   12707 |
|       2 |   53480 |
+---------+---------+

My intent is to

Add a new column
Insert a specific value into that column, 'NewColumnValue', on each row of the csv
Sort the file based on the value in Column1
Split the original CSV into new files based on the contents of 'Column1', removing the header

For example, I want to end up with multiple files that look like:

+---+-------+----------------+
| 1 | 19593 | NewColumnValue |
| 1 | 93644 | NewColumnValue |
| 1 | 12707 | NewColumnValue |
+---+-------+----------------+

+---+-------+-----------------+
| 2 | 63246 | NewColumnValue |
| 2 | 53480 | NewColumnValue |
+---+-------+-----------------+

+---+-------+-----------------+
| 3 | 47790 | NewColumnValue |
| 3 | 39644 | NewColumnValue |
| 3 | 32585 | NewColumnValue |
+---+-------+-----------------+

I have managed to do this using separate .py files:

Step1

# -*- coding: utf-8 -*-
import pandas as pd
df = pd.read_csv('source.csv')
df = df.sort_values('Column1')
df['NewColumn'] = 'NewColumnValue'
df.to_csv('ready.csv', index=False, header=False)

Step2

import csv
from itertools import groupby
for key, rows in groupby(csv.reader(open("ready.csv")),
                         lambda row: row[0]):
    with open("%s.csv" % key, "w") as output:
        for row in rows:
            output.write(",".join(row) + "\n")

But I'd really like to learn how to accomplish everything in a single .py file. I tried this:

# -*- coding: utf-8 -*-
#This processes a large CSV file.  
#It will dd a new column, populate the new column with a uniform piece of data for each row, sort the CSV, and remove headers
#Then it will split the single large CSV into multiple CSVs based on the value in column 0 
import pandas as pd
import csv
from itertools import groupby
df = pd.read_csv('source.csv')
df = df.sort_values('Column1')
df['NewColumn'] = 'NewColumnValue'
for key, rows in groupby(csv.reader((df)),
                         lambda row: row[0]):
    with open("%s.csv" % key, "w") as output:
        for row in rows:
            output.write(",".join(row) + "\n")

but instead of working as intended, it's giving me multiple CSVs named after each column header.

Is that happening because I removed the header row when I used separate .py files and I'm not doing it here? I'm not really certain what operation I need to do when splitting the files to remove the header.

cs95 · Accepted Answer · 2017-09-08 23:46:52Z

12

Why not just groupby Column1 and save each group?

df = df.sort_values('Column1').assign(NewColumn='NewColumnValue')
print(df)

   Column1  Column2       NewColumn
0        1    93644  NewColumnValue
5        1    19593  NewColumnValue
6        1    12707  NewColumnValue
1        2    63246  NewColumnValue
7        2    53480  NewColumnValue
2        3    47790  NewColumnValue
3        3    39644  NewColumnValue
4        3    32585  NewColumnValue

for i, g in df.groupby('Column1'):
    g.to_csv('{}.csv'.format(i), header=False, index_label=False)

Thanks to Unatiel for the improvement. header=False will not write headers and index_label=False will not write an index column.

This creates 3 files:

1.csv
2.csv
3.csv

Each having data corresponding to each Column1 group.

edited Sep 8, 2017 at 23:46

answered Sep 8, 2017 at 23:23

cs95

406k106 gold badges744 silver badges797 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Steve Dallas Over a year ago

Thanks. And to answer your question, it was because when I spent today searching google for a solution to the problem pandas kept cropping up. I didn't realize there was another way :)

Steve Dallas Over a year ago

I tried this method and it's producing some odd output files. Taking my file 13.csv as an example, it produces a file that has headers and additionally it seems to be inserting a headerless column with data I don't recognize. e.g. +------+---------+---------+----------------+ | | Column1 | Column2 | NewColumn | | 6446 | 13 | 36457 | NewColumnValue | +------+---------+---------+----------------+

Unatiel Over a year ago

You should try g.to_csv('{}.csv'.format(i), header=False, index_label=False). header=False won't write the headers and index_label=False won't write that headerless column (this is the index of your dataframe).

Steve Dallas Over a year ago

Excellent! I added a 'index=False' and it worked like a charm. Thanks cᴏʟᴅsᴘᴇᴇᴅ and @Unatiel

Mike · Accepted Answer · 2017-09-09 01:26:17Z

2

You don't need to switch to itertools for the filtering, pandas has all of the necessary functionality built-in.

# -*- coding: utf-8 -*-
import pandas as pd
df = pd.read_csv('source.csv')
df = df.sort_values('Column1')  # Sorting isn't needed
df['NewColumn'] = 'NewColumnValue'
for key in df['Column1'].unique():  # For each value in Column1
    # These two steps can be combined into a single call
    # I'll separate for clarity:  
    # 1) filter the dataframe on the unique value
    dw = df[df['Column1']==key]   
    # 2) write the resulting dataframe without headers
    dw.to_csv("%s.csv" % key, header=False)

edited Sep 9, 2017 at 1:26

Mike

8122 gold badges8 silver badges19 bronze badges

answered Sep 8, 2017 at 22:37

SnoProblem

1,94914 silver badges14 bronze badges

1 Comment

Steve Dallas Over a year ago

Thanks for answering. It gave me an error on line 12, but after I changed it to "dw.to_csv("%s.csv" % key, header=False) " it worked great.

Unatiel · Accepted Answer · 2017-09-08 22:38:06Z

1

pandas.DataFrame supports a method to write it's data as a csv to_csv(). You have no need for csv module in this case.

import pandas as pd

df = pd.read_csv('source.csv')
df = df.sort_values('Column1').set_index('Column1')
df['NewColumn'] = 'NewColumnValue'
for key in df.index.unique():
    df.loc[key].to_csv('%d.csv' % int(key), header=False)

for key df.index.unique(): will loop over every unique value in the index. In your example, it will loop over (1, 2 , 3). header=False willmake sure the header isn't written to the output file.

And to explain why you get the wrong output in your example, try print(list(df)). This should output all the columns in df. This is why for key, rows in csv.reader((df)): iterates over the columns in df.

Actually, you should get 1 csv for every column in your dataframe, and their contents are likely something like ,[NAME_OF_COLUMN] or maybe ,<itertools.... object at 0x.....>.

answered Sep 8, 2017 at 22:38

Unatiel

1,0801 gold badge11 silver badges17 bronze badges

3 Comments

Steve Dallas Over a year ago

Thanks for answering. However, I'm getting a syntax error on: df = df.sort_values('Column1').set_index('Column1')

Unatiel Over a year ago

I don't get any syntax error on this, with python3. That being said, @cᴏʟᴅsᴘᴇᴇᴅ's answer is better, you should use it instead.

Unatiel Over a year ago

@SteveDallas Now that I think about it, df = will indeed output a syntax error. I think you've added a new line by mistake after df = . I don't really see where the syntax error would be in my code. I'm a bit tired right now though.

Collectives™ on Stack Overflow

Splitting a dataframe into separate CSV files

3 Answers 3

4 Comments

1 Comment

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

4 Comments

1 Comment

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related