1

I have a CSV file with multiple categorical columns, but most of these columns contain messy data due to typing mistakes (e.g., 'spciulated', 'SPICULATED', etc. for the category 'spiculated' of the column 'margins'). Is there a standard way to deal with such situations?

To be precise, I would like to read the CSV file directly into a clean DataFrame with a dtype category for the categorical columns, but with all variants collapsed into one category (e.g., each variant of 'spiculated' would be read as 'spiculated'). The spelling variants could be given by a dict, for instance.

Expected solution:

import pandas as pd

FEAT_VALS = {
    "margins": {
        "spiculated": ["spiculated", "spiiculated", "SPICULATED"],
        "circumscribed": ["circumscribed", "cicumscribed"],
    }
}

# somehow give FEAT_VALS to read_csv
df = pd.read_csv('test.csv', dtype='category')
df.margins

where test.csv is:

margins
spiculated
spiiculated
SPICULATED
circumscribed
cicumscribed

to obtain:

0       spiculated
1       spiculated
2       spiculated
3    circumscribed
4    circumscribed
Name: margins, dtype: category
Categories (2, object): ['circumscribed', 'spiculated']

However, without the spelling variants information, I get:

0       spiculated
1      spiiculated
2       SPICULATED
3    circumscribed
4     cicumscribed
Name: margins, dtype: category
Categories (5, object): ['SPICULATED', 'cicumscribed', 'circumscribed', 'spiculated', 'spiiculated']

My current solution: Looks like this

df2 = pd.read_csv('test.csv')

for feat, feat_vals in FEAT_VALS.items():
    for enc_val, str_vals in feat_vals.items():
        df2.loc[df2[feat].isin(str_vals), feat] = enc_val

df2.margins = df2.margins.astype('category')
6
  • 3
    please provide a minimal reproducible example, with your input and expected output: stackoverflow.com/help/minimal-reproducible-example Commented Jul 15, 2024 at 17:53
  • 2
    Do you know the correct categories beforehand? If not then you'll have to use a clustering algorithm instead of just a Levenstein distance. Commented Jul 15, 2024 at 18:13
  • @Fravadona Yes. Commented Jul 15, 2024 at 18:31
  • @AlekFröhlich We can't come up with any answer if you don't provide some sample lines of the CSV file (simplified is better) plus the expected output you want to get out of it Commented Jul 16, 2024 at 11:56
  • @Fravadona Updated accordingly. Commented Jul 16, 2024 at 12:32

2 Answers 2

1

You could flip the inner dicts then use .map(). This isn't as direct as you want, but at least the code using Pandas is cleaner.

for feat, feat_vals in FEAT_VALS.items():
    feat_strs = {
        str_val: enc_val
        for enc_val, str_vals in feat_vals.items()
        for str_val in str_vals
    }
    df[feat] = df[feat].map(feat_strs).astype('category')

    # For demo
    print(df[feat].cat.categories)

Output:

Index(['circumscribed', 'spiculated'], dtype='object')
Sign up to request clarification or add additional context in comments.

1 Comment

I see, thank you!
1

There are three things to consider in your question:

  1. Make the strings somewhat comparable (for eg. remove the accents and use the same case).
    What is the best way to remove accents (normalize) in a Python unicode string?

  2. Select a fitting distance algorithm/library.
    looking for python library which can perform levenshtein/other edit distance at word-level

  3. Create a DataFrame from the transformed input.
    Create a pandas DataFrame from generator?


Here's a sample implementation that uses the Levenshtein distance for comparisons; now you just need to provide the expected categories (ie. no FEAT_VALS anymore). Also, the CSV being updated lazily means that the memory imprint is kept as low as possible.

import csv
import unicodedata
import Levenshtein
import pandas as pd

def normalize_string(str):
    return ''.join(
        char for char in unicodedata.normalize('NFD', str)
        if unicodedata.category(char) != 'Mn'
    ).lower()

def parse_csv(filename, categories = ["spiculated", "circumscribed"]):
    with open(filename) as f:
        reader = csv.DictReader(f)
        normalized_categories = [normalize_string(str) for str in categories]
        for row in reader:
            min_distance = float('inf')
            normalized_margins = normalize_string(row["margins"])
            for i,normalized_category in enumerate(normalized_categories):
                distance = Levenshtein.distance(
                    normalized_margins,
                    normalized_category
                )
                if distance < min_distance:
                    min_distance = distance
                    row["margins"] = categories[i]
            yield row

df = pd.DataFrame(data=parse_csv("test.csv"), dtype='category')

print(df.margins)

output:

0       spiculated
1       spiculated
2       spiculated
3    circumscribed
4    circumscribed
Name: margins, dtype: category
Categories (2, object): ['circumscribed', 'spiculated']

1 Comment

Very nice idea, thanks!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.