2

I want to read CSV file using python by skiprows dynamically after condition.

Condition - whenever I found 6 cols in CSV read from there or either when i find col names sequence as those 6 cols.

File.csv

Col1,col2,col3

1,2,3

13,u,u

,,,

,,,

Col1,col2,col3,col4

1,2,3,4

13,u,u,y

,,,

,,,

Col1,col2,col3,col4,col5,col6

1,2,3,4,5,6

qw,ers,hh,yj,df,ji

Now I'm reading this file using pandas.read_csv()

I know that at 10th row i have required cols.

pandas.read_csv("file.csv", skiprows=10, header=None)

Want to skip this dynamically by skipping rows when we 6 cols or either in this sequence col1,col2,col3,col4,col5,col6.

start =  df.loc[df.FILE-START == 'col1,col2,col3,col4,col5,col6'].index[0]
df = pd.read_csv(filename, skiprows = start + 1)

Tried this but it's not working.

1
  • 2
    Why does the csv file contain different numbers of columns? That seems like a bad mistake. Commented Apr 1, 2023 at 18:37

3 Answers 3

2

Update

A more robust version using csv module:

import pandas as pd
import csv
import io

with open('File.csv') as fp:
    while True:
        pos = fp.tell()
        reader = csv.reader(io.StringIO(fp.readline()))
        row = next(reader)
        if len(row) == 6:
            break
    fp.seek(pos)
    df = pd.read_csv(fp)

Old answer

You can read the file line by line until you found 6 columns or 5 commas (take care if you have quotes and comma between them. But it's fine for a simple csv file:

import pandas as pd

with open('File.csv') as fp:
    while True:
        pos = fp.tell()
        row = fp.readline()
        if row.count(',') == 5:
            break
    fp.seek(pos)
    df = pd.read_csv(fp)

Output:

>>> df
  Col1 col2 col3 col4 col5 col6
0    1    2    3    4    5    6
1   qw  ers   hh   yj   df   ji
Sign up to request clarification or add additional context in comments.

3 Comments

your solution is great Corralien. but I think Keshav is asking about "Col1,col2,col3,col4,col5,col6" this as an input itself.
I'm not really sure "by skipping rows when we 6 cols or either in this sequence col1,col2,col3,col4,col5,col6*
Thanks Corralien for quick help.
2

Another option with pandas' DataFrame constructor :

import csv
import pandas as pd

with open("file.csv") as csv_file:
    csv_reader = csv.reader(csv_file)
    rows = [row for row in csv_reader if len(row) == 6]
    data_six = {"columns": rows[0], "data": rows[1:]}​
    df = pd.DataFrame(**data_six)

As explained by @Corralien, with this approach, pandas lose the ability to infer data types for each column since csv.reader returns always a list of strings.

csv.reader(csvfile, dialect='excel', **fmtparams)
Return a reader object which will iterate over lines in the given csvfile. csvfile can be any object which supports the iterator protocol and returns a string each time its _next_() method is called — file objects and list objects are both suitable. Each row read from the csv file is returned as a list of strings.

Source : [docs.python]

Output :

print(df)

  Col1 col2 col3 col4 col5 col6
0    1    2    3    4    5    6
1   qw  ers   hh   yj   df   ji

Nota: this assumes that your csv file always ends up with six columns data and with a unique header.

6 Comments

Hi. The problem with this method is that you lost the ability to Pandas to infer data types but it's a good solution too. +1
Hi Corralien, can you explain why please ? What if we chain it with convert_dtypes ?
csv.reader doesn't cast to int or float, you get only strings. So if you have a numeric columns, it will be set to object. Yes it's probably a good idea to use convert_dtypes. I'm agree with that.
I updated the csv file so we get numeric-like values in the first column and when using convert_dtypes, print(df.dtypes) returns String for all the columns, dunno why!
To be honest, I can never get convert_dtypes to work properly. Maybe you can ask to @mozway. Try with convert_dtypes(convert_string=False)
|
1

You can use the approach as follows:

def check_num_or_colseq(row):
    return len(row)==6 or (row[0]=='col1' and row[1]=='col2' and row[2]=='col3' and row[3]=='col4' and row[4]=='col5' and row[5]=='col6')

 // suppose you read the csv file
    readervar = csv.reader(file)
    for i,row in enumrate(readervar):
          if check_num_or_colseq(row):
              skip = i 
              break

df = pd.read_csv(filename, skiprows = skip + 1)

I think all of the code above is self-explanatory. Hope this helps.

1 Comment

You read the file twice especially if the file is large but your solution works well too. +1

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.