0

I need to process a big size csv file (~2GB). Due to low memory limitation, I am using chunksize option to load the piece of csv at a time in memory rather than loading an entire csv file.I need to identify the last chunk of csv and skip n rows from that chunk. At this point I am not sure how to implement this. Any help is appreciated. Thanks in advance!

1 Answer 1

0

In order to do it, you need to know the size of the file, then divide it by the chunk size and identify the last chunk that way. Unfortunately, you need to skim through the entire file to find out the total number. However, you can do it by a small size of memory.

f = open(file)
i = -1
for i, l in enumerate(f):
     pass

At the end of this loop, i+1 is the size of the file. Using Dask to process big files in parallel is another option, but I am not sure if you can get the last chunk of the file or not.

Sign up to request clarification or add additional context in comments.

2 Comments

Thanks @Amin Gheibi, seems like need to do this way only as I searched a lot and there is no way that pandas provide to directly get the last chunk.
Since this information is not maintained in the meta of the file, unfortunately, any solution is input sensitive. I mean you should count at least how many '\n' exists.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.