I am trying to parse some csv files with a few thousand rows. The csv is structured with comma as a delimiter and quotation marks enclosing each field. I want to use pd.read_csv() to parse the data without skipping the faulty lines with on_bad_line='pass' argument.
Example:
"Authors","Title","ID","URN"
"Overflow, Stack, Doe, John Society of overflowers "Stack" (something) ","50 years of overflowing in "Stack" : 50 years of stacking---",""117348359","URN:ND_649C52T1A9K1JJ51"
"Tyson, Mike","Me boxing","525166266","URN:SD_95125N2N3523N5KB"
"Smith, Garry",""Funny name" : a book about names","951992851","URN:ND_95N5J2N352BV525N25"
"The club of clubs","My beloved, the "overflow stacker"","9551236651","URN:SD_955K2B61J346F1N25"
I have tried to illustrate the problematic structure of the csv file. In the example above, only the second row would get parsed without problems and others would have issues because of enclosed quotation marks or commas within the field bounds.
When I ran the script with the following command:
df = pd.read_csv(path, engine='python', delimiter=",", quotechar='"', encoding="utf-8", on_bad_lines='warn')
I get errors on problematic lines: ParserWarning: Skipping line 1477: ',' expected after '"' Similar error when I try the default engine or pyarrow.
Now, what I want to accomplish is pass a handler function to the on_bad_lines= argument that would skip the line if it can not be parsed but at the same time store the values that are found in a new variable as a dictionary or a list so I can manually add that data later in the code. I have tried this function and passed it as a value for on_bad_lines= but the bad_lines list just ended up empty anyway:
bad_lines = []
def handle_bad_lines(bad_line):
print(f"Skipping bad line: {bad_line}")
bad_lines.append(bad_line)
return None
Thank you.