3

I have a large file, imported into a single dataframe in Pandas. I'm using pandas to split up a file into many segments, by the number of rows in the dataframe.

eg: 10 rows: file 1 gets [0:4] file 2 gets [5:9]

Is there a way to do this without having to create more dataframes?

6
  • split by what kind of rule ? Commented Nov 21, 2017 at 20:14
  • thanks for the catch. I've updated the question with that detail Commented Nov 21, 2017 at 20:16
  • 1
    df.iloc[0:4,:].to_csv(path) and just iterate over that... Commented Nov 21, 2017 at 20:17
  • 2
    df.iloc[:4,:] and df.iloc[5:,:] Commented Nov 21, 2017 at 20:18
  • What's the reason why this has to be done in pandas? From the current description (large file, being split by rows) you could do it from the command line using 'split'. Commented Nov 21, 2017 at 20:18

4 Answers 4

4

assign a new column g here, you just need to specific how many item you want in each groupby, here I am using 3 .

df.assign(g=df.index//3)
Out[324]: 
    0  g
0   1  0
1   2  0
2   3  0
3   4  1
4   5  1
5   6  1
6   7  2
7   8  2
8   9  2
9  10  3

and you can call the df[df.g==1] to get what you need

Sign up to request clarification or add additional context in comments.

1 Comment

do we really need that new column? df[np.arange(len(df))//3==1]
4

There are two ways of doing this. I believe you are looking for the former. Basically, we open a series of csv writers, then we write to the correct csv writer by using some basic math with the index, then we close all files.

A single DataFrame evenly divided into N number of CSV files

import pandas as pd
import csv, math

df = pd.DataFrame([1,2,3,4,5,6,7,8,9,10]) # uncreative input values for 10 columns
NUMBER_OF_SPLITS = 2
fileOpens = [open(f"out{i}.csv","w") for i in range(NUMBER_OF_SPLITS)]
fileWriters = [csv.writer(v, lineterminator='\n') for v in fileOpens]
for i,row in df.iterrows():
    fileWriters[math.floor((i/df.shape[0])*NUMBER_OF_SPLITS)].writerow(row.tolist())
for file in fileOpens:
    file.close()

More than one DataFrame evenly divided into N number of CSV files

import pandas as pd
import numpy as np

df = pd.DataFrame([1,2,3,4,5,6,7,8,9,10]) # uncreative input values for 10 columns
NUMBER_OF_SPLITS = 2
for i, new_df in enumerate(np.array_split(df,NUMBER_OF_SPLITS)):
    with open(f"out{i}.csv","w") as fo:
            fo.write(new_df.to_csv())

3 Comments

This solution forces the creation of a new df.
@billyc59 Updated it.
Why do you use the file write method in combination with df.to_csv(). The .to_csv() method is already writing data to a file. In your case, I will get empty rows in the new CSVs.
2

use numpy.array_split to split your dataframe dfX and save it in N csv files of equal size: dfX_1.csv to dfX_N.csv

N = 10
for i, df in enumerate(np.array_split(dfX, N)):
    df.to_csv(f"dfX_{i + 1}.csv", index=False)

Comments

0

iterating over iloc's arguments will do the trick.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.