pandas read_csv and keep only certain rows (python)

Question

I am aware of the skiprows that allows you to pass a list with the indices of the rows to skip. However, I have the index of the rows I want to keep.

Say that my cvs file looks like this for millions of rows:

The list of indices i would like to load are only 2,3, so

index_list = [2,3]

The input for the skiprows function would be [0,1,4]. However, I only have available [2,3].

I am trying something like:

pd.read_csv(path, skiprows = ~index_list)

but no luck.. any suggestions?

thank and I appreciate all the help,

Can you provide the exact code instead of a template?

Sreejith Menon
– Sreejith Menon

2016-09-06 00:57:50 +00:00
Commented Sep 6, 2016 at 0:57 — Sreejith Menon
– Sreejith Menon, Commented Sep 6, 2016 at 0:57
@ Sreejith hopefully its more readable now.

dleal
– dleal

2016-09-06 01:20:27 +00:00
Commented Sep 6, 2016 at 1:20 — dleal
– dleal, Commented Sep 6, 2016 at 1:20

wcyn · Accepted Answer · 2019-03-23 15:50:36Z

20

You can pass in a lambda function in the skiprows argument. For example:

rows_to_keep = [2,3]
pd.read_csv(path, skiprows = lambda x: x not in rows_to_keep)

You can read more about it in the documentation here

answered Mar 23, 2019 at 15:50

wcyn

4,2962 gold badges34 silver badges26 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

mimocha Over a year ago

I did some testing and found that for the argument skiprows, passing a list is much faster than passing a lambda function. Passing a list appears to be O(1), whereas passing a lambda func is O(N). So for very large CSV files, I strongly recommend generating the list of rows to skip from a list of known rows to keep first, like gabra's answer. (Results as of pandas v1.4.1)

Luke · Accepted Answer · 2025-01-09 20:43:12Z

13

I think you would need to find the number of lines first, like this.

num_lines = sum(1 for line in open('myfile.txt'))

Then you would need to delete the indices of index_list:

to_exclude = [i for i in range(num_lines) if i not in index_list]

and then load your data:

pd.read_csv(path, skiprows = to_exclude)

edited Jan 9 at 20:43

Luke

54 bronze badges

answered Sep 6, 2016 at 2:14

gabra

10.8k4 gold badges32 silver badges46 bronze badges

3 Comments

dleal Over a year ago

Thank you gabra I figured I would have to do something like this. It seems odd that there is skiprows but not one to read certain rows

gabra Over a year ago

@dleal I agree with you. This also relates to your question.

mfastudillo Over a year ago

you'd need to put [i for i in range(num_lines) if i not in index_list] right ? num_lines is not iterable, is an integer

Divyanshu Srivastava · Accepted Answer · 2022-07-06 09:43:40Z

1

Another simple solution to this could be to call .loc right after read_csv. Something like this

index_to_keep = [2, 4]
pd.read_csv(path).loc[index_to_keep]

Note: This is a slower approach, as here the entire file will be first loaded in the memory and then only seleted rows will be selected.

answered Jul 6, 2022 at 9:43

Divyanshu Srivastava

1,54713 silver badges26 bronze badges

Comments

L Tyrone · Accepted Answer · 2024-04-07 05:50:15Z

This solution requires using both the skiprows and nrows parameters in the read_csv function call. I needed to read a Google sheet and preserve the header on line one (as displayed in the sheet) which, of course, is line 0 in a pandas dataframe. Here is what I came up with. Google sheets can be read as a csv file:

1 Timestamp First Name  English Name    Family Name Country School
2 3/7/2024 16:16:32 Matthew     Chandra Indonesia   Beaumont College
3 3/7/2024 16:25:17 Ngan Ka Kevin   Leung   Hong Kong   Beaumont College
4 3/7/2024 16:27:32 Bryan       Hariadi Indonesia   Beaumont College
5 3/7/2024 16:35:01 Rebecca     Yu  China   Beaumont College
6 3/7/2024 16:51:52 Juan        Kim Korea   Beaumont College
7 3/7/2024 16:53:50 Takaaki Taka    Shirakawa   Japan   Beaumont College
8 3/7/2024 16:53:59 Tomoya      Imamura Japan   Beaumont College
9 3/7/2024 17:04:49 Aliz        Vo  Vietnam Freeborn College
10 3/14/2024 16:46:10   Shoma       Iriguchi    Japan   Freeborn College
11 3/14/2024 16:46:11   Jaseong     Kim Korea   Freeborn College
12 3/28/2024 16:10:41   Jin Xin     Li  China   Beaumont College
13 3/28/2024 16:14:44   Aoi     Hanaoka Japan   Beaumont College
14 3/28/2024 17:20:03   Bioni       Mandiri Indonesia   Beaumont College
15 3/28/2024 17:23:29   Chloe       Budryanoo   Indonesia   Beaumont College
16 4/4/2024 16:20:34    Leticia     Tirtokuncoro    Indonesia   Beaumont College

Here is the code:

import pandas as pd

pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', 100)
pd.set_option('display.width', 300)

from_line = 5
to_line = 12

nrows = to_line - from_line + 1
skiprows = list(range(1,from_line - 1))

print(f'skiprows = {skiprows}, nrows = {nrows}')

# Change to your Google sheet
file_name = "https://docs.google.com/spreadsheets/..."

df = pd.read_csv(file_name,  header=0, skip_blank_lines=False, skiprows=skiprows, nrows=nrows)
print(df)

And here is the result:

skiprows = [1, 2, 3], nrows = 8
            Timestamp First Name English Name Family Name  Country            School
0   3/7/2024 16:35:01    Rebecca          NaN          Yu    China  Beaumont College
1   3/7/2024 16:51:52       Juan          NaN         Kim    Korea  Beaumont College
2   3/7/2024 16:53:50    Takaaki         Taka   Shirakawa    Japan  Beaumont College
3   3/7/2024 16:53:59     Tomoya          NaN     Imamura    Japan  Beaumont College
4   3/7/2024 17:04:49       Aliz          NaN          Vo  Vietnam  Freeborn College
5  3/14/2024 16:46:10      Shoma          NaN    Iriguchi    Japan  Freeborn College
6  3/14/2024 16:46:11    Jaseong          NaN         Kim    Korea  Freeborn College
7  3/28/2024 16:10:41    Jin Xin          NaN          Li    China  Beaumont College

Collectives™ on Stack Overflow

pandas read_csv and keep only certain rows (python)

4 Answers 4

1 Comment

3 Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

1 Comment

3 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related