The data file is to big for RAM, so I can't use .read_csv() -> concat -> .to_csv(). Is there a easy option to concat two DataFrames?
1 Answer
I have an idea to read a batch of n rows (within RAM limits) from each csv file, and write/append it to a new csv file. Note that all files must have the same column schema.
Below codes seem to work on my small csv files. You could try on larger ones with a larger batch size, and let me know if it works.
filenames = ['file1.csv', 'file2.csv', 'file3.csv']
batch_size = 2
df = pd.read_csv(filenames[0], nrows=0)
df.to_csv('new.csv', index=False) #save the header
for filename in filenames:
this_batch = batch_size
i = 0
while this_batch == batch_size:
df = pd.read_csv(filename, nrows=batch_size, skiprows=batch_size*i)
this_batch = len(df)
i += 1
df.to_csv('new.csv', mode='a', index=False, header=None)
iteratorandchunksizeto help with reading in big files. Check that outmode="a"forto_csv