Read CSV file with skip rows until we find certain no of columns using python

Question

I want to read CSV file using python by skiprows dynamically after condition.

Condition - whenever I found 6 cols in CSV read from there or either when i find col names sequence as those 6 cols.

File.csv

Col1,col2,col3

1,2,3

13,u,u

,,,

,,,

Col1,col2,col3,col4

1,2,3,4

13,u,u,y

,,,

,,,

Col1,col2,col3,col4,col5,col6

1,2,3,4,5,6

qw,ers,hh,yj,df,ji

Now I'm reading this file using pandas.read_csv()

I know that at 10th row i have required cols.

pandas.read_csv("file.csv", skiprows=10, header=None)

Want to skip this dynamically by skipping rows when we 6 cols or either in this sequence col1,col2,col3,col4,col5,col6.

start =  df.loc[df.FILE-START == 'col1,col2,col3,col4,col5,col6'].index[0]
df = pd.read_csv(filename, skiprows = start + 1)

Tried this but it's not working.

Why does the csv file contain different numbers of columns? That seems like a bad mistake. — John Gordon
– John Gordon, Commented Apr 1, 2023 at 18:37

Corralien · Accepted Answer · 2023-04-01 18:48:21Z

2

Update

A more robust version using csv module:

import pandas as pd
import csv
import io

with open('File.csv') as fp:
    while True:
        pos = fp.tell()
        reader = csv.reader(io.StringIO(fp.readline()))
        row = next(reader)
        if len(row) == 6:
            break
    fp.seek(pos)
    df = pd.read_csv(fp)

Old answer

You can read the file line by line until you found 6 columns or 5 commas (take care if you have quotes and comma between them. But it's fine for a simple csv file:

import pandas as pd

with open('File.csv') as fp:
    while True:
        pos = fp.tell()
        row = fp.readline()
        if row.count(',') == 5:
            break
    fp.seek(pos)
    df = pd.read_csv(fp)

Output:

>>> df
  Col1 col2 col3 col4 col5 col6
0    1    2    3    4    5    6
1   qw  ers   hh   yj   df   ji

edited Apr 1, 2023 at 18:48

answered Apr 1, 2023 at 18:35

Corralien

121k8 gold badges43 silver badges68 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Suchandra T Over a year ago

your solution is great Corralien. but I think Keshav is asking about "Col1,col2,col3,col4,col5,col6" this as an input itself.

Corralien Over a year ago

I'm not really sure "by skipping rows when we 6 cols or either in this sequence col1,col2,col3,col4,col5,col6*

Keshav Over a year ago

Thanks Corralien for quick help.

Timeless · Accepted Answer · 2023-04-01 19:41:52Z

2

Another option with pandas' DataFrame constructor :

import csv
import pandas as pd

with open("file.csv") as csv_file:
    csv_reader = csv.reader(csv_file)
    rows = [row for row in csv_reader if len(row) == 6]
    data_six = {"columns": rows[0], "data": rows[1:]}
    df = pd.DataFrame(**data_six)

As explained by @Corralien, with this approach, pandas lose the ability to infer data types for each column since csv.reader returns always a list of strings.

csv.reader(csvfile, dialect='excel', **fmtparams)
Return a reader object which will iterate over lines in the given csvfile. csvfile can be any object which supports the iterator protocol and returns a string each time its _next_() method is called — file objects and list objects are both suitable. Each row read from the csv file is returned as a list of strings.

Source : [docs.python]

Output :

print(df)

  Col1 col2 col3 col4 col5 col6
0    1    2    3    4    5    6
1   qw  ers   hh   yj   df   ji

Nota: this assumes that your csv file always ends up with six columns data and with a unique header.

edited Apr 1, 2023 at 19:41

answered Apr 1, 2023 at 19:11

Timeless

38.3k6 gold badges33 silver badges54 bronze badges

6 Comments

Corralien Over a year ago

Hi. The problem with this method is that you lost the ability to Pandas to infer data types but it's a good solution too. +1

Timeless Over a year ago

Hi Corralien, can you explain why please ? What if we chain it with convert_dtypes ?

Corralien Over a year ago

csv.reader doesn't cast to int or float, you get only strings. So if you have a numeric columns, it will be set to object. Yes it's probably a good idea to use convert_dtypes. I'm agree with that.

Timeless Over a year ago

I updated the csv file so we get numeric-like values in the first column and when using convert_dtypes, print(df.dtypes) returns String for all the columns, dunno why!

Corralien Over a year ago

To be honest, I can never get convert_dtypes to work properly. Maybe you can ask to @mozway. Try with convert_dtypes(convert_string=False)

|

Suchandra T · Accepted Answer · 2023-04-01 18:42:42Z

1

You can use the approach as follows:

def check_num_or_colseq(row):
    return len(row)==6 or (row[0]=='col1' and row[1]=='col2' and row[2]=='col3' and row[3]=='col4' and row[4]=='col5' and row[5]=='col6')

 // suppose you read the csv file
    readervar = csv.reader(file)
    for i,row in enumrate(readervar):
          if check_num_or_colseq(row):
              skip = i 
              break

df = pd.read_csv(filename, skiprows = skip + 1)

I think all of the code above is self-explanatory. Hope this helps.

answered Apr 1, 2023 at 18:42

Suchandra T

6465 silver badges9 bronze badges

1 Comment

Corralien Over a year ago

You read the file twice especially if the file is large but your solution works well too. +1

Collectives™ on Stack Overflow

Read CSV file with skip rows until we find certain no of columns using python

3 Answers 3

3 Comments

6 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

3 Comments

6 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related