1

I am trying to parse some csv files with a few thousand rows. The csv is structured with comma as a delimiter and quotation marks enclosing each field. I want to use pd.read_csv() to parse the data without skipping the faulty lines with on_bad_line='pass' argument.

Example:

"Authors","Title","ID","URN"
"Overflow, Stack, Doe, John Society of overflowers "Stack" (something) ","50 years of overflowing in "Stack" : 50 years of stacking---",""117348359","URN:ND_649C52T1A9K1JJ51"
"Tyson, Mike","Me boxing","525166266","URN:SD_95125N2N3523N5KB"
"Smith, Garry",""Funny name" : a book about names","951992851","URN:ND_95N5J2N352BV525N25"
"The club of clubs","My beloved, the "overflow stacker"","9551236651","URN:SD_955K2B61J346F1N25"

I have tried to illustrate the problematic structure of the csv file. In the example above, only the second row would get parsed without problems and others would have issues because of enclosed quotation marks or commas within the field bounds.

When I ran the script with the following command:

df = pd.read_csv(path, engine='python', delimiter=",", quotechar='"', encoding="utf-8", on_bad_lines='warn')

I get errors on problematic lines: ParserWarning: Skipping line 1477: ',' expected after '"' Similar error when I try the default engine or pyarrow.

Now, what I want to accomplish is pass a handler function to the on_bad_lines= argument that would skip the line if it can not be parsed but at the same time store the values that are found in a new variable as a dictionary or a list so I can manually add that data later in the code. I have tried this function and passed it as a value for on_bad_lines= but the bad_lines list just ended up empty anyway:

bad_lines = []
def handle_bad_lines(bad_line):
    print(f"Skipping bad line: {bad_line}")
    bad_lines.append(bad_line) 
    return None

Thank you.

4 Answers 4

0

I wonder if you can fix your bad CSV first, then run the good CSV through Pandas; don't try to fix it in Pandas (wrong tool for the job).

I don't know how representative your sample CSV really is, but it looks like someone hand-rolled their own CSV encoder by doing something like enclosing every item in a list with double quotes, then joining those elements with commas:

data = [
    ["amy bob", "i"],
    ["cam, deb", "ii"],
    ['"el" fay', "iii"],
]

for row in data:
    quoted = [f'"{x}"' for x in row]
    print(",".join(quoted))

which prints:

"amy bob","i"
"cam, deb","ii"
""el" fay","iii"

Given your sample, would the original data look something like?

[
    ['Authors',        'Title',                 'ID',         'URN'                      ],
    ['..."Stack"...',  '...in "Stack"...',      '"117348359', 'URN:ND_649C52T1A9K1JJ51'  ],
    ['Tyson, Mike',    'Me boxing',             '525166266',  'URN:SD_95125N2N3523N5KB'  ],
    ['Smith, Garry',   '"Funny name"...',       '951992851',  'URN:ND_95N5J2N352BV525N25'],
    ['The club of...', '..."overflow stacker"', '9551236651', 'URN:SD_955K2B61J346F1N25' ],
]

Not sure about the double quote in the ID of the first record, "117348359. I assume that's a typo you made typing up the sample.

If so, you might start by assuming that all lines:

  • have a beginning and ending double quote ("bookend quotes")
  • the sequence "," only appears in between fields, you don't expect to see that sequence in the data itself
N_COLS = 4

with open("input-bad.csv") as f:
    for i, line in enumerate(f, start=1):
        line = line.strip()

        if line[0] != '"' or line[-1] != '"':
            print(f"line {i} doesn't have bookend quotes: {line}")

        if (n := line.count('","')) != N_COLS - 1:
            print(f"line {i} appears to have {n} cols: {line}")

Adding a "bad record" at the end of your sample:

['Mr Baz', 'foo"," the bar', '99999999', 'URN:TX_77777777']
Mr Baz","foo"," the bar","99999999","URN:TX_77777777

would print:

line 6 doesn't have bookend quotes: Mr Baz","foo"," the bar","99999999","URN:TX_77777777
line 6 appears to have 4 cols:      Mr Baz","foo"," the bar","99999999","URN:TX_77777777

Hopefully that doesn't print out anything, or only a small number of records you could deal with (I don't how you'd deal with them, though).

If so, then you can fix the file and output good CSV:

import csv, sys


def process_line(line: str) -> str:
    """
    Strip surrounding whitespace, remove bookend quotes.
    """
    return line.strip()[1:-1]


writer = csv.writer(sys.stdout)

with open("input-bad.csv") as f:
    for line in f:
        line = process_line(line)
        fields = line.split('","')
        writer.writerow(fields)

Running that on your sample I get (with leading spaces for readability):

Authors,           Title,                        ID,            URN
"...""Stack""...", "...in ""Stack""...",         """117348359", URN:ND_649C52T1A9K1JJ51
"Tyson, Mike",     Me boxing,                    525166266,     URN:SD_95125N2N3523N5KB
"Smith, Garry",    """Funny name""...",          951992851,     URN:ND_95N5J2N352BV525N25
The club of...,    "...""overflow stacker""...", 9551236651,    URN:SD_955K2B61J346F1N25

Again, ""117348359 looks weird, but I'll leave that for you.

Sign up to request clarification or add additional context in comments.

2 Comments

This answer gave me the most starting ideas how to approach my problem so thanks. I have ended up making 2 functions that preprocess the original csv file, write the fixed version with \ as an escape character for double quotes and commas and then read the new csv file without any problems with pandas. Not very pythonic but I have tackled exactly what I needed. Also yeah, the ""117348359 was my typo when trying to recreate the csv.
@MitarZečević, great, and thanks for letting me know how you worked it out! And without seeing the solution, I wouldn't worry about "Pythonic", that's just a judgement on code-style. I think what matters most is that you got it working. Please accept the answer (checkmark near the top of the answer) if you think it helped you solve the problem.
0

You should try to filter the file first, by looking for an odd number of quotes (") or counting the number of commas in each line, or using any other method you prefer. After filtering you can do:

data = pd.read_csv('bad_data.csv', header=0,names=["Authors","Title","ID","URN"], delimiter=',', quotechar='"', doublequote=False, encoding='utf-8')
print(data)

Or you can try something like this code, which filters the lines, separates the bad ones and prints them, saves the good ones in a dictionary, and then creates the DataFrame at the end.

Either way you will need to create your own filter.

import pandas as pd
# Manual filter
with open('bad_data.csv', mode='r', encoding='utf-8') as f:
    data_csv = f.readlines()

dict_filtered = {"Authors":[], "Title":[], "ID": [], "URN": []}
for line in data_csv:
    if "Authors" not in line: # Skip header line
        # Some handle to recognize bad data
        # Example -> data with more than 5 commas 
        list_data = line.split(',')
        if len(list_data) > 5:
            print(f'Bad line:{line}')
        else:
            
            author = f"{list_data[0]}, {list_data[1]}"
            title = list_data[2]
            ID = list_data[3]
            URN = list_data[4]

            dict_filtered['Authors'].append(author)
            dict_filtered['Title'].append(title)
            dict_filtered['ID'].append(ID)
            dict_filtered['URN'].append(URN)

print("DATAFRAME")
df = pd.DataFrame(dict_filtered)
print(df)

2 Comments

don't do this: data_csv = f.readlines() just iterate over the file object directly. Note, .readlines is an archaic part of the API, you should just use list(f) if you actually want .readlines(), although, you dont in this case.
also, this doesn't correctly parse the CSV, you cannot just do line.split() (note, the OP's csv is supposed to have four fields), so note, this line: "Tyson, Mike","Me boxing","525166266","URN:SD_95125N2N3523N5KB" and your code makes the assumption that the first fild always has one comma but that isn't the case. This is why you should always use the csv module to parse csvs
0

So the problem here as far as I understand is that the CSV file has some rows with issues — things like unmatched quotes or commas inside quoted fields — which mess up parsing when using pd.read_csv() and instead of letting pd.read_csv() throw errors or skip those bad rows, you want to handle them manually. Here's the solution that I think serves your purpose.

import pandas as pd
import csv

data = []
bad_lines = []

with open("path.csv", mode="r", encoding="utf-8") as file:
    reader = csv.reader(file, delimiter=",", quotechar='"')
    headers = next(reader)  

    for i, row in enumerate(reader, start=2):
        try:
            if row.count('"') % 2 != 0:
                raise ValueError("Unmatched quotes in row")
        
            if any(',' in field and '"' in field for field in row): #(indicating possible unescaped commas)
                raise ValueError("Commas inside quoted field.")
        
            data.append(row) #append if no issues exist
    
        except Exception as e:
            # Store bad lines
            bad_lines.append((i, row, str(e)))

uncorrupted_df = pd.DataFrame(data, columns=headers)

# Print bad lines encountered
print("Following bad lines encountered:")
for line in bad_lines:
    print(f"Line {line[0]}: {line[1]} (Error: {line[2]})")

This approach will catch and handle malformed rows instead of letting read_csv() silently skip them or throw errors etc. I hope this works for you!

Comments

0

TLDR:

Looking at the pandas code for 1.4.1 and I assume for 1.4.0 There seems to be a bug that simply bypasses the callable specification. You can verify that by putting a break point inside your method and see that it is never called.

Apparently this has been resolved in newer (v2) pandas and the pyarrow engine.

So the solution is likely to be upgrade to a newer version of pandas or patch your version.

Background:

In python_parser.py (about line 744) we find this method that works for the on_bad_lines "error" and "warn" conditions but fails to handle the callable path and thus simply returns None in the event of a parse error. When this method returns None the caller (_next_line(self)) just ignores that line.

I'm guessing here that this is because getting the actual data from line = next(self.data) is tricky as self.data is a csv.reader object and this iterator throws without exposing the actual data.

    def _next_iter_line(self, row_num: int) -> list[Scalar] | None:
        """
        Wrapper around iterating through `self.data` (CSV source).

        When a CSV error is raised, we check for specific
        error messages that allow us to customize the
        error message displayed to the user.

        Parameters
        ----------
        row_num: int
            The row number of the line being parsed.
        """
        try:
            # assert for mypy, data is Iterator[str] or None, would error in next
            assert self.data is not None
            line = next(self.data)  ### <---- This throws with your data
            # for mypy
            assert isinstance(line, list)
            return line
        except csv.Error as e:
            if (
                self.on_bad_lines == self.BadLineHandleMethod.ERROR
                or self.on_bad_lines == self.BadLineHandleMethod.WARN
            ):
                msg = str(e)

                if "NULL byte" in msg or "line contains NUL" in msg:
                    msg = (
                        "NULL byte detected. This byte "
                        "cannot be processed in Python's "
                        "native csv library at the moment, "
                        "so please pass in engine='c' instead"
                    )

                if self.skipfooter > 0:
                    reason = (
                        "Error could possibly be due to "
                        "parsing errors in the skipped footer rows "
                        "(the skipfooter keyword is only applied "
                        "after Python's csv library has parsed "
                        "all rows)."
                    )
                    msg += ". " + reason

                self._alert_malformed(msg, row_num)
            return None

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.