I have a corpus of text which includes some accented words, such as épée, and I would like people to be able to easily search through it using an ASCII keyboard. Ideally, they would simply type protege or pinata to find protégé or piñata. The program is currently written in Python and uses only the builtin libraries, such as re.

I have looked at similar questions, such as Why does re not ignore accents, but the suggested solution is to normalize the unicode string to ASCII. That could be made to work, but seems inordinately ugly and doesn't return the actual text that should be displayed. Does Python not have anything analogous to POSIX character equivalence, which maps similar characters together based on the user's locale? For example,grep -E '[[=e=]][[=p=]][[=e=]][[=e=]]' matches both epee and épée (in the en_US.UTF-8 locale).

3 Replies 3

Well, then please define the rules for interpreting pinata as piñata, and the like. The problem is not really technical.

So far, your requirements look self-contradicting. Pinata as piñata, formally speaking, are just different words, no more, no less.

Also, I would ask you: why you cannot trust users just typing the characters they actually use? People who know how to spell and read protégé or piñata should know how to enter them.

The question is unclear to begin with. There's no such thing as an "ASCII keyboard", no example code or regex patterns, no expected and actual outcome. Do you mean a US keyboard?

Is the real question how to perform an accent insensitive search?

Python 3 strings are Unicode, so there are no "ASCII" issues to resolve. It doesn't matter what the user's or terminal's encoding is, Python 3 strings are always Unicode. Whenever you see a coded issue it's because someone tried to read a file (or bytes) using the wrong codepage. And typically, that involves attempts to hard-code the 7-bit ASCII codepage.

Accents are another matter. The equivalence of characters and their order in a language is a collation issue. Danish uses the Latin1 codepage but AA at the start of the word is the same as Å. Unfortunately, regex implementations generally only deal with case sensitivity, not accents, and Python is no exception.

FormD Normalization works because it decomposes characters to equivalent combined characters.

The output of this may look the same as "épée"

unicodedata.normalize('NFD',"épée")

but this regex returns 3 matches instead of 1. It works because é was replaced by 2 combined characters, that together produce the correct glyph. Annoying, but that's the state of regular expressions

re.findall("e",unicodedata.normalize('NFD',"épée"))
---------
['e', 'e', 'e']

re.findall("A",unicodedata.normalize('NFD',"Århus"))
---------
['A']

Databases

Databases on the other hand are aware of collations, as they affect matching and ordering. In most databases you can specify what collation will be used at the column level, which means you can have Case-Insensitive and Accent-Insensitive columns and index them. In some of them (not all) you can even create indexes with different collations.

If you load your data into a CI-AI column you can search for exact matches (WHERE product_name='epee') or do a prefix search (WHERE product_name LIKE 'ep%') and take advantage of indexes.


Keyboards

Keyboards don't know anything about codepages. The OS translates keystrokes to actual characters and all OSs allow you to use multiple languages and switch between them. You can easily add a French or UK keyboard if you like. On a UK keyboart AltGr+e (or right Alt) is é. On a US keyboard, AltGr acts as right Alt only. The OS itself may have shortcuts.

On Windows Notepad I typed è, é, ï these with of Ctrl+` e, Ctrl+' e Ctrl+; i`, but that doesn't work in a browser, and VS Code uses these as shortcuts. And there are certainly character selection utilities, online keyboards etc.

There is no equivalence set notation in the re or regex modules, the POSIX notation [=e=] is unsupported. Likewise, the Unicode equivalent [\p{toNFD=/e/}] or [\p{toNFKD=/e/}] is also not supported by the Python modules.

The simplest approach would be (?i)[eé]\u0301?p[eé]\u0301?[eé]\u0301?. Or using linked lists in the regex module with a pattern like r'(?if)\L<e>\u0301?p\L<e>\u0301?\L<e>\u0301?' where e is ['e', 'é'].

You also need it to support case insensitivity and canonical equivalence for a full solution. This would rely on external libraries such as icu4c and its Python wrapper.

With case insensitivity in python's re it is important to understand it uses simple casefolding instead of full casefolding, and has some differences hardwired in. The regex can support either full or simple casefolding depending on what version (0 or 1) is being used, and what flags are in included in the pattern.

Take for instance the search prompt cafe searching English only material, it would need to match cafe, café, cafe\u0301, CAFE, CAFÉ, CAFE\u0301, Cafe, Café, Cafe\u0301.

In databases, this can be achieved by accent and case insensitive collations, and is commonly used. It is the default in library management systems, for instance, and many search engines.

It your data sets or corpora are large, then something like like a lucene/solr stack might be useful.

I'm in the process writing an article exploring some of these issues.

Your Reply

By clicking “Post Your Reply”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.