13

I am aware of the skiprows that allows you to pass a list with the indices of the rows to skip. However, I have the index of the rows I want to keep.

Say that my cvs file looks like this for millions of rows:

  A B
0 1 2
1 3 4
2 5 6
3 7 8
4 9 0

The list of indices i would like to load are only 2,3, so

index_list = [2,3]

The input for the skiprows function would be [0,1,4]. However, I only have available [2,3].

I am trying something like:

pd.read_csv(path, skiprows = ~index_list)

but no luck.. any suggestions?

thank and I appreciate all the help,

2
  • Can you provide the exact code instead of a template? Commented Sep 6, 2016 at 0:57
  • @ Sreejith hopefully its more readable now. Commented Sep 6, 2016 at 1:20

4 Answers 4

20

You can pass in a lambda function in the skiprows argument. For example:

rows_to_keep = [2,3]
pd.read_csv(path, skiprows = lambda x: x not in rows_to_keep)

You can read more about it in the documentation here

Sign up to request clarification or add additional context in comments.

1 Comment

I did some testing and found that for the argument skiprows, passing a list is much faster than passing a lambda function. Passing a list appears to be O(1), whereas passing a lambda func is O(N). So for very large CSV files, I strongly recommend generating the list of rows to skip from a list of known rows to keep first, like gabra's answer. (Results as of pandas v1.4.1)
13

I think you would need to find the number of lines first, like this.

num_lines = sum(1 for line in open('myfile.txt'))

Then you would need to delete the indices of index_list:

to_exclude = [i for i in range(num_lines) if i not in index_list]

and then load your data:

pd.read_csv(path, skiprows = to_exclude)

3 Comments

Thank you gabra I figured I would have to do something like this. It seems odd that there is skiprows but not one to read certain rows
@dleal I agree with you. This also relates to your question.
you'd need to put [i for i in range(num_lines) if i not in index_list] right ? num_lines is not iterable, is an integer
1

Another simple solution to this could be to call .loc right after read_csv. Something like this

index_to_keep = [2, 4]
pd.read_csv(path).loc[index_to_keep]

Note: This is a slower approach, as here the entire file will be first loaded in the memory and then only seleted rows will be selected.

Comments

0

This solution requires using both the skiprows and nrows parameters in the read_csv function call. I needed to read a Google sheet and preserve the header on line one (as displayed in the sheet) which, of course, is line 0 in a pandas dataframe. Here is what I came up with. Google sheets can be read as a csv file:

1 Timestamp First Name  English Name    Family Name Country School
2 3/7/2024 16:16:32 Matthew     Chandra Indonesia   Beaumont College
3 3/7/2024 16:25:17 Ngan Ka Kevin   Leung   Hong Kong   Beaumont College
4 3/7/2024 16:27:32 Bryan       Hariadi Indonesia   Beaumont College
5 3/7/2024 16:35:01 Rebecca     Yu  China   Beaumont College
6 3/7/2024 16:51:52 Juan        Kim Korea   Beaumont College
7 3/7/2024 16:53:50 Takaaki Taka    Shirakawa   Japan   Beaumont College
8 3/7/2024 16:53:59 Tomoya      Imamura Japan   Beaumont College
9 3/7/2024 17:04:49 Aliz        Vo  Vietnam Freeborn College
10 3/14/2024 16:46:10   Shoma       Iriguchi    Japan   Freeborn College
11 3/14/2024 16:46:11   Jaseong     Kim Korea   Freeborn College
12 3/28/2024 16:10:41   Jin Xin     Li  China   Beaumont College
13 3/28/2024 16:14:44   Aoi     Hanaoka Japan   Beaumont College
14 3/28/2024 17:20:03   Bioni       Mandiri Indonesia   Beaumont College
15 3/28/2024 17:23:29   Chloe       Budryanoo   Indonesia   Beaumont College
16 4/4/2024 16:20:34    Leticia     Tirtokuncoro    Indonesia   Beaumont College

Here is the code:

import pandas as pd

pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', 100)
pd.set_option('display.width', 300)

from_line = 5
to_line = 12

nrows = to_line - from_line + 1
skiprows = list(range(1,from_line - 1))

print(f'skiprows = {skiprows}, nrows = {nrows}')

# Change to your Google sheet
file_name = "https://docs.google.com/spreadsheets/..."

df = pd.read_csv(file_name,  header=0, skip_blank_lines=False, skiprows=skiprows, nrows=nrows)
print(df)

And here is the result:

skiprows = [1, 2, 3], nrows = 8
            Timestamp First Name English Name Family Name  Country            School
0   3/7/2024 16:35:01    Rebecca          NaN          Yu    China  Beaumont College
1   3/7/2024 16:51:52       Juan          NaN         Kim    Korea  Beaumont College
2   3/7/2024 16:53:50    Takaaki         Taka   Shirakawa    Japan  Beaumont College
3   3/7/2024 16:53:59     Tomoya          NaN     Imamura    Japan  Beaumont College
4   3/7/2024 17:04:49       Aliz          NaN          Vo  Vietnam  Freeborn College
5  3/14/2024 16:46:10      Shoma          NaN    Iriguchi    Japan  Freeborn College
6  3/14/2024 16:46:11    Jaseong          NaN         Kim    Korea  Freeborn College
7  3/28/2024 16:10:41    Jin Xin          NaN          Li    China  Beaumont College

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.