0

The data file is to big for RAM, so I can't use .read_csv() -> concat -> .to_csv(). Is there a easy option to concat two DataFrames?

4
  • have you considered using generators? stackoverflow.com/questions/18915941/… Commented May 30, 2021 at 12:25
  • read_csv() has parameters like iterator and chunksize to help with reading in big files. Check that out Commented May 30, 2021 at 12:41
  • If you just need to append files, you can read in individual files and append them all together using mode="a" for to_csv Commented May 30, 2021 at 12:42
  • 1
    If your dataset is exceeding memory you should try DASK which allows you to work with large datasets for both data manipulation works well with python libraries like NumPy, scikit-learn, etc. More info: Dask and pandas: There’s No Such Thing as Too Much Data Commented May 30, 2021 at 12:52

1 Answer 1

0

I have an idea to read a batch of n rows (within RAM limits) from each csv file, and write/append it to a new csv file. Note that all files must have the same column schema.

Below codes seem to work on my small csv files. You could try on larger ones with a larger batch size, and let me know if it works.

filenames = ['file1.csv', 'file2.csv', 'file3.csv']
batch_size = 2
df = pd.read_csv(filenames[0], nrows=0)
df.to_csv('new.csv', index=False)   #save the header

for filename in filenames:
    this_batch = batch_size
    i = 0
    while this_batch == batch_size:
        df = pd.read_csv(filename, nrows=batch_size, skiprows=batch_size*i)
        this_batch = len(df)
        i += 1
        df.to_csv('new.csv', mode='a', index=False, header=None)
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.