Is there a Pandas idiom for reading a CSV file with categorical data that has spelling variants?

Question

I have a CSV file with multiple categorical columns, but most of these columns contain messy data due to typing mistakes (e.g., 'spciulated', 'SPICULATED', etc. for the category 'spiculated' of the column 'margins'). Is there a standard way to deal with such situations?

To be precise, I would like to read the CSV file directly into a clean DataFrame with a dtype category for the categorical columns, but with all variants collapsed into one category (e.g., each variant of 'spiculated' would be read as 'spiculated'). The spelling variants could be given by a dict, for instance.

Expected solution:

import pandas as pd

FEAT_VALS = {
    "margins": {
        "spiculated": ["spiculated", "spiiculated", "SPICULATED"],
        "circumscribed": ["circumscribed", "cicumscribed"],
    }
}

# somehow give FEAT_VALS to read_csv
df = pd.read_csv('test.csv', dtype='category')
df.margins

where test.csv is:

margins
spiculated
spiiculated
SPICULATED
circumscribed
cicumscribed

to obtain:

0       spiculated
1       spiculated
2       spiculated
3    circumscribed
4    circumscribed
Name: margins, dtype: category
Categories (2, object): ['circumscribed', 'spiculated']

However, without the spelling variants information, I get:

0       spiculated
1      spiiculated
2       SPICULATED
3    circumscribed
4     cicumscribed
Name: margins, dtype: category
Categories (5, object): ['SPICULATED', 'cicumscribed', 'circumscribed', 'spiculated', 'spiiculated']

My current solution: Looks like this

df2 = pd.read_csv('test.csv')

for feat, feat_vals in FEAT_VALS.items():
    for enc_val, str_vals in feat_vals.items():
        df2.loc[df2[feat].isin(str_vals), feat] = enc_val

df2.margins = df2.margins.astype('category')

please provide a minimal reproducible example, with your input and expected output: stackoverflow.com/help/minimal-reproducible-example — iBeMeltin
– iBeMeltin, Commented Jul 15, 2024 at 17:53
Do you know the correct categories beforehand? If not then you'll have to use a clustering algorithm instead of just a Levenstein distance. — Fravadona
– Fravadona, Commented Jul 15, 2024 at 18:13
@AlekFröhlich We can't come up with any answer if you don't provide some sample lines of the CSV file (simplified is better) plus the expected output you want to get out of it — Fravadona
– Fravadona, Commented Jul 16, 2024 at 11:56

wjandrea · Accepted Answer · 2024-07-16 23:38:29Z

1

You could flip the inner dicts then use .map(). This isn't as direct as you want, but at least the code using Pandas is cleaner.

for feat, feat_vals in FEAT_VALS.items():
    feat_strs = {
        str_val: enc_val
        for enc_val, str_vals in feat_vals.items()
        for str_val in str_vals
    }
    df[feat] = df[feat].map(feat_strs).astype('category')

    # For demo
    print(df[feat].cat.categories)

Output:

Index(['circumscribed', 'spiculated'], dtype='object')

answered Jul 16, 2024 at 23:38

wjandrea

33.8k10 gold badges69 silver badges105 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Alek Fröhlich Over a year ago

I see, thank you!

Fravadona · Accepted Answer · 2024-07-24 07:28:14Z

There are three things to consider in your question:

Make the strings somewhat comparable (for eg. remove the accents and use the same case).
What is the best way to remove accents (normalize) in a Python unicode string?
Select a fitting distance algorithm/library.
looking for python library which can perform levenshtein/other edit distance at word-level
Create a DataFrame from the transformed input.
Create a pandas DataFrame from generator?

Here's a sample implementation that uses the Levenshtein distance for comparisons; now you just need to provide the expected categories (ie. no FEAT_VALS anymore). Also, the CSV being updated lazily means that the memory imprint is kept as low as possible.

import csv
import unicodedata
import Levenshtein
import pandas as pd

def normalize_string(str):
    return ''.join(
        char for char in unicodedata.normalize('NFD', str)
        if unicodedata.category(char) != 'Mn'
    ).lower()

def parse_csv(filename, categories = ["spiculated", "circumscribed"]):
    with open(filename) as f:
        reader = csv.DictReader(f)
        normalized_categories = [normalize_string(str) for str in categories]
        for row in reader:
            min_distance = float('inf')
            normalized_margins = normalize_string(row["margins"])
            for i,normalized_category in enumerate(normalized_categories):
                distance = Levenshtein.distance(
                    normalized_margins,
                    normalized_category
                )
                if distance < min_distance:
                    min_distance = distance
                    row["margins"] = categories[i]
            yield row

df = pd.DataFrame(data=parse_csv("test.csv"), dtype='category')

print(df.margins)

output:

0       spiculated
1       spiculated
2       spiculated
3    circumscribed
4    circumscribed
Name: margins, dtype: category
Categories (2, object): ['circumscribed', 'spiculated']

Collectives™ on Stack Overflow

Is there a Pandas idiom for reading a CSV file with categorical data that has spelling variants?

2 Answers 2

1 Comment

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related