Pandas : how to skip n rows from the end of csv file when using chunksize option for reading the csv

Question

I need to process a big size csv file (~2GB). Due to low memory limitation, I am using chunksize option to load the piece of csv at a time in memory rather than loading an entire csv file.I need to identify the last chunk of csv and skip n rows from that chunk. At this point I am not sure how to implement this. Any help is appreciated. Thanks in advance!

Amin Gheibi · Accepted Answer · 2020-11-06 07:48:05Z

0

In order to do it, you need to know the size of the file, then divide it by the chunk size and identify the last chunk that way. Unfortunately, you need to skim through the entire file to find out the total number. However, you can do it by a small size of memory.

f = open(file)
i = -1
for i, l in enumerate(f):
     pass

At the end of this loop, i+1 is the size of the file. Using Dask to process big files in parallel is another option, but I am not sure if you can get the last chunk of the file or not.

answered Nov 6, 2020 at 7:48

Amin Gheibi

7881 gold badge10 silver badges19 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Anvita Over a year ago

Thanks @Amin Gheibi, seems like need to do this way only as I searched a lot and there is no way that pandas provide to directly get the last chunk.

Amin Gheibi Over a year ago

Since this information is not maintained in the meta of the file, unfortunately, any solution is input sensitive. I mean you should count at least how many '\n' exists.

Collectives™ on Stack Overflow

Pandas : how to skip n rows from the end of csv file when using chunksize option for reading the csv

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related