0

In order to clean some string, I have to remove some substring that contains some special UTF-8 characters.

example:

source = "Skoda"
to_be_clean = "Škoda Rapid"

I need to replace from to_be_clean the string source by nothing. Obviously, the to_be_clean string contains some special character. Is there a way to do this task simply. Here is how I am doing it today.

output = to_be_clean.replace(source + ' ', '')

I was thinking about a regular expression but I need to list all the possible characters.

1
  • 4
    It's really not clear what you want. Are you hoping to find a way to make "Škoda" equal to "Skoda" so that you can then remove it? There are many questions about removing accents from Unicode; have you googled those? Commented Feb 21, 2018 at 15:01

1 Answer 1

2

unicodedata module should solve your problem.

# -*- coding: utf-8 -*-

import unicodedata
to_be_clean = u"Škoda Rapid"

print unicodedata.normalize('NFKD', to_be_clean).encode('ASCII', 'ignore')

Output:

Skoda Rapid
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks, exactly what I was looking for. I was actually not aware of unicodedata module. Thanks

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.