0

I have a CSV from a system that has a load of rubbish at the top of the file, so the header row is about row 5 or could even be 14 depending on the gibberish the report puts out.

I used to use:

idx = next(idx for idx, row in enumerate(csvreader) if len(row) > 2)

to go through the rows that had less than 2 columns, then when it hit the col headers, of which there are 12, it would stop, and then I could use idx with skiprows when reading the CSV file.

The system has had an update and someone thought it would be good to have the CSV file valid by adding in 11 blank commas after their gibberish to align the header count.

so now I have a CSV like:

sadjfhasdkljfhasd,,,,,,,,,,
dsfasdgasfg,,,,,,,,,,
time,date,code,product 

etc..

I tried:

idx = next(idx for idx, row in enumerate(csvreader) if row in (None, "") > 2)

but I think that's a Pandas thing and it just fails.

Any ideas on how i can get to my header row?

CODE:

lmf = askopenfilename(filetypes=(("CSV Files",".csv"),("All Files","*.*")))
    # Section gets row number where headers start
    with open(lmf, 'r') as fin:
        csvreader = csv.reader(fin)
        print(csvreader)
        input('hold')
        idx = next(idx for idx, row in enumerate(csvreader) if len(row) > 2)

    # Reopens file parsing the number for the row headers
    lmkcsv = pd.read_csv(lmf, skiprows=idx)
    lm = lm.append(lmkcsv)
    print(lm)
2
  • How do you determine the header row? First one where all values are non-empty? The one that starts with "time"? Commented Jan 9, 2019 at 18:36
  • 1
    What is this supposed to do? if row in (None, "") > 2? Is this the "pandas thing"? This is actually a chained comparison, which will get interpreted as row in (None, "") and (None, "") > 2, which will always be false. Commented Jan 9, 2019 at 18:43

2 Answers 2

1

Since your csv is now a valid file and you just want to filter out the header rows without a certain amount of columns, you can just do that in pandas directly.

import pandas as pd
minimum_cols_required = 3
lmkcsv = pd.read_csv()
lmkcsv = lmkcsv.dropna(thresh=minimum_cols_required, inplace=True)

If your csv data have a lot of empty values as well that gets caught in this threshold, then just slightly modify your code:

idx = next(idx for idx, row in enumerate(csvreader) if len(set(row)) > 3)

I'm not sure in what case a None would return, so the set(row) should do. If your headers for whatever are duplicates as well, do this:

from collections import Counter
# ...
idx = next(idx for idx, row in enumerate(csvreader) if len(row) - Counter(row)[''] > 2)
Sign up to request clarification or add additional context in comments.

1 Comment

Thank you, I used the "Set()" just in case, as i don't know what the dataset could look like on a regular basis.
0

And how about erasing the starting lines, doing some logic, like checking many ',' exist's or some word. Something like:

f = open("target.txt","r+")
d = f.readlines()
f.seek(0)
for i in d:
    if "sadjfhasdkljfhasd" not in i:
        f.write(i)
f.truncate()

f.close()

after that, read normaly the file.

1 Comment

So if there are N unrelated lines OP will have to do N comparison? What if the "values" in the lines are different each time? How would you catch that?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.