I need to process a big size csv file (~2GB). Due to low memory limitation, I am using chunksize option to load the piece of csv at a time in memory rather than loading an entire csv file.I need to identify the last chunk of csv and skip n rows from that chunk. At this point I am not sure how to implement this. Any help is appreciated. Thanks in advance!
Pandas : how to skip n rows from the end of csv file when using chunksize option for reading the csv
1 Answer
In order to do it, you need to know the size of the file, then divide it by the chunk size and identify the last chunk that way. Unfortunately, you need to skim through the entire file to find out the total number. However, you can do it by a small size of memory.
f = open(file)
i = -1
for i, l in enumerate(f):
pass
At the end of this loop, i+1 is the size of the file. Using Dask to process big files in parallel is another option, but I am not sure if you can get the last chunk of the file or not.
2 Comments
Anvita
Thanks @Amin Gheibi, seems like need to do this way only as I searched a lot and there is no way that pandas provide to directly get the last chunk.
Amin Gheibi
Since this information is not maintained in the meta of the file, unfortunately, any solution is input sensitive. I mean you should count at least how many '\n' exists.