Python CSV Reader

Question

I have a CSV from a system that has a load of rubbish at the top of the file, so the header row is about row 5 or could even be 14 depending on the gibberish the report puts out.

I used to use:

idx = next(idx for idx, row in enumerate(csvreader) if len(row) > 2)

to go through the rows that had less than 2 columns, then when it hit the col headers, of which there are 12, it would stop, and then I could use idx with skiprows when reading the CSV file.

The system has had an update and someone thought it would be good to have the CSV file valid by adding in 11 blank commas after their gibberish to align the header count.

so now I have a CSV like:

sadjfhasdkljfhasd,,,,,,,,,,
dsfasdgasfg,,,,,,,,,,
time,date,code,product

etc..

I tried:

idx = next(idx for idx, row in enumerate(csvreader) if row in (None, "") > 2)

but I think that's a Pandas thing and it just fails.

Any ideas on how i can get to my header row?

CODE:

lmf = askopenfilename(filetypes=(("CSV Files",".csv"),("All Files","*.*")))
    # Section gets row number where headers start
    with open(lmf, 'r') as fin:
        csvreader = csv.reader(fin)
        print(csvreader)
        input('hold')
        idx = next(idx for idx, row in enumerate(csvreader) if len(row) > 2)

    # Reopens file parsing the number for the row headers
    lmkcsv = pd.read_csv(lmf, skiprows=idx)
    lm = lm.append(lmkcsv)
    print(lm)

How do you determine the header row? First one where all values are non-empty? The one that starts with "time"? — jonrsharpe
– jonrsharpe, Commented Jan 9, 2019 at 18:36
What is this supposed to do? if row in (None, "") > 2? Is this the "pandas thing"? This is actually a chained comparison, which will get interpreted as row in (None, "") and (None, "") > 2, which will always be false. — juanpa.arrivillaga
– juanpa.arrivillaga, Commented Jan 9, 2019 at 18:43

r.ook · Accepted Answer · 2019-01-09 19:28:30Z

1

Since your csv is now a valid file and you just want to filter out the header rows without a certain amount of columns, you can just do that in pandas directly.

import pandas as pd
minimum_cols_required = 3
lmkcsv = pd.read_csv()
lmkcsv = lmkcsv.dropna(thresh=minimum_cols_required, inplace=True)

If your csv data have a lot of empty values as well that gets caught in this threshold, then just slightly modify your code:

idx = next(idx for idx, row in enumerate(csvreader) if len(set(row)) > 3)

I'm not sure in what case a None would return, so the set(row) should do. If your headers for whatever are duplicates as well, do this:

from collections import Counter
# ...
idx = next(idx for idx, row in enumerate(csvreader) if len(row) - Counter(row)[''] > 2)

edited Jan 9, 2019 at 19:28

answered Jan 9, 2019 at 19:12

r.ook

13.9k2 gold badges26 silver badges41 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Runawaygeek Over a year ago

Thank you, I used the "Set()" just in case, as i don't know what the dataset could look like on a regular basis.

Recigio Poffo · Accepted Answer · 2019-01-09 18:45:48Z

0

And how about erasing the starting lines, doing some logic, like checking many ',' exist's or some word. Something like:

f = open("target.txt","r+")
d = f.readlines()
f.seek(0)
for i in d:
    if "sadjfhasdkljfhasd" not in i:
        f.write(i)
f.truncate()

f.close()

after that, read normaly the file.

answered Jan 9, 2019 at 18:45

Recigio Poffo

615 bronze badges

1 Comment

r.ook Over a year ago

So if there are N unrelated lines OP will have to do N comparison? What if the "values" in the lines are different each time? How would you catch that?

Collectives™ on Stack Overflow

Python CSV Reader

2 Answers 2

1 Comment

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related