Concrete JavaScript regular expression for accented characters (diacritics)

Question

I've looked on Stack Overflow (replacing characters.. eh, how JavaScript doesn't follow the Unicode standard concerning RegExp, etc.) and haven't really found a concrete answer to the question "How can JavaScript match accented characters (those with diacritical marks)?"

I'm forcing a field in a UI to match the format: last_name, first_name (last [comma space] first), and I want to provide support for diacritics, but evidently in JavaScript it's a bit more difficult than other languages/platforms.

This was my original version, until I wanted to add diacritic support:

/^[a-zA-Z]+,\s[a-zA-Z]+$/

Currently I'm debating one of three methods to add support, all of which I have tested and work (at least to some extent, I don't really know what the "extent" is of the second approach). Here they are:

Explicitly listing all accented characters that I would want to accept as valid (lame and overly-complicated):

var accentedCharacters = "àèìòùÀÈÌÒÙáéíóúýÁÉÍÓÚÝâêîôûÂÊÎÔÛãñõÃÑÕäëïöüÿÄËÏÖÜŸçÇßØøÅåÆæœ";
// Build the full regex
var regex = "^[a-zA-Z" + accentedCharacters + "]+,\\s[a-zA-Z" + accentedCharacters + "]+$";
// Create a RegExp from the string version
regexCompiled = new RegExp(regex);
// regexCompiled = /^[a-zA-ZàèìòùÀÈÌÒÙáéíóúýÁÉÍÓÚÝâêîôûÂÊÎÔÛãñõÃÑÕäëïöüÿÄËÏÖÜŸçÇßØøÅåÆæœ]+,\s[a-zA-ZàèìòùÀÈÌÒÙáéíóúýÁÉÍÓÚÝâêîôûÂÊÎÔÛãñõÃÑÕäëïöüÿÄËÏÖÜŸçÇßØøÅåÆæœ]+$/

This correctly matches a last/first name with any of the supported accented characters in accentedCharacters.

My other approach was to use the `.` character class, to have a simpler expression:

var regex = /^.+,\s.+$/;

This would match for just about anything, at least in the form of: something, something. That's alright I suppose...

The last approach, which I just found might be simpler...

/^[a-zA-Z\u00C0-\u017F]+,\s[a-zA-Z\u00C0-\u017F]+$/

It matches a range of Unicode characters - tested and working, though I didn't try anything crazy, just the normal stuff I see in our language department for faculty member names.

Here are my concerns:

The first solution is far too limiting, and sloppy and convoluted at that. It would need to be changed if I forgot a character or two, and that's just not very practical.
The second solution is better, concise, but it probably matches far more than it actually should. I couldn't find any real documentation on exactly what . matches, just the generalization of "any character except the newline character" (from a table on the MDN).
The third solution seems the be the most precise, but are there any gotchas? I'm not very familiar with Unicode, at least in practice, but looking at a code table/continuation of that table, \u00C0-\u017F seems to be pretty solid, at least for my expected input.

Faculty won't be submitting forms with their names in their native language (e.g., Arabic, Chinese, Japanese, etc.), so I don't have to worry about out-of-Latin-character-set characters

Which of these three approaches is most suited for the task? Or are there better solutions?

There seems to be no particular reason to use the more complicated regexps. Only thing about the most simple solution is, it will also match "something, something, something". You could use something like regex = /^[^,]+,\s[^,]+$/; to prevent that. — Jongware
– Jongware, Commented Dec 19, 2013 at 20:53
At a glance, the first one won't match the common name "O'Donnell, Chris" nor compound last names with a hyphen, nor multiple last names (etc.). See Falsehoods Programmers Believe About Names for just about every possible pitfalls. — Jongware
– Jongware, Commented Dec 19, 2013 at 20:58
"the . atom matches anything except newlines" actually is quite exact :-) — Bergi
– Bergi, Commented Dec 19, 2013 at 21:22
If it is possible for you to use an additional library you can have a look at my answer here — stema
– stema, Commented Dec 19, 2013 at 21:40
Jongware, I actually just read that article while I was browsing SO for an answer to my question - I also completely forgot about hyphens and apostrophes and the like, I was more concerned with making it international first :P I'm glad you brought it up though! And Stema, I actually looked at that library and I avoid incorporating libraries because this is all on Google Apps Script - incorporating external libraries would be a nightmare, and I would only be using it (in this case) for one particular field... kind of overkill :P — Chris Cirefice
– Chris Cirefice, Commented Dec 19, 2013 at 22:14

Peter Mortensen · Accepted Answer · 2022-02-04 16:41:12Z

529

The easier way to accept all accents is this:

[A-zÀ-ú] // accepts lowercase and uppercase characters
[A-zÀ-ÿ] // as above, but including letters with an umlaut (includes [ ] ^ \ × ÷)
[A-Za-zÀ-ÿ] // as above but not including [ ] ^ \
[A-Za-zÀ-ÖØ-öø-ÿ] // as above, but not including [ ] ^ \ × ÷

See Unicode Character Table for characters listed in numeric order.

edited Feb 4, 2022 at 16:41

Peter Mortensen

31.4k22 gold badges110 silver badges134 bronze badges

answered Nov 13, 2014 at 2:02

Maycow Moura

7,0372 gold badges27 silver badges20 bronze badges

Sign up to request clarification or add additional context in comments.

16 Comments

Pierre Henry Over a year ago

It works nicely, +1, but could you elaborate why it works ?

Angad Over a year ago

@PierreHenry the - defines a range, and this technique exploits the ordering of characters in the charset to define a continuous range, making for a super concise solution to the problem

jcuenod Over a year ago

won't this match underscores (and the other non-word characters between Z and a)?

Nate Over a year ago

This matches at least the characters [, ], ^, and \, none of which should be included.

pacoverflow Over a year ago

Reading the comments and seeing all the accented letters that aren't matched, and all the non-letters that are matched, it appears there is no good solution to this problem.

|

Chaim Leib Halbert · Accepted Answer · 2020-11-10 19:16:33Z

76

The accented Latin range \u00C0-\u017F was not quite enough for my database of names, so I extended the regex to

[a-zA-Z\u00C0-\u024F]
[a-zA-Z\u00C0-\u024F\u1E00-\u1EFF] // includes even more Latin chars

I added these code blocks (\u00C0-\u024F includes three adjacent blocks at once):

\u00C0-\u00FF Latin-1 Supplement
\u0100-\u017F Latin Extended-A
\u0180-\u024F Latin Extended-B
\u1E00-\u1EFF Latin Extended Additional

Note that \u00C0-\u00FF is actually only a part of Latin-1 Supplement. It skips unprintable control signals and all symbols except for the awkwardly-placed multiply × \u00D7 and divide ÷ \u00F7.

[a-zA-Z\u00C0-\u00D6\u00D8-\u00F6\u00F8-\u024F] // exclude ×÷

If you need more code points, you can find more ranges on Wikipedia's List of Unicode characters. For example, you could also add Latin Extended-C, D, and E, but I left them out because only historians seem interested in them now, and the D and E sets don't even render correctly in my browser.

The original regex stopping at \u017F borked on the name "Șenol". According to FontSpace's Unicode Analyzer, that first character is \u0218, LATIN CAPITAL LETTER S WITH COMMA BELOW. (Yeah, it's usually spelled with a cedilla-S \u015E, "Şenol." But I'm not flying to Turkey to go tell him, "You're spelling your name wrong!")

edited Nov 10, 2020 at 19:16

answered Aug 24, 2016 at 23:38

Chaim Leib Halbert

2,36422 silver badges23 bronze badges

2 Comments

cprcrack Over a year ago

Having a look at the unicode table latin block, I think you should also include \u1e00-\u1eff, so I'm doing [a-zA-Z\u00c0-\u024f\u1e00-\u1eff]

Barnee Over a year ago

This is the same thing but with glyphs: [a-zA-ZÀ-ÖÙ-öù-ÿĀ-žḀ-ỿ0-9].

Peter Mortensen · Accepted Answer · 2022-02-04 16:24:55Z

28

/^[\pL\pM\p{Zs}.-]+$/u

Explanation:

\pL - matches any kind of letter from any language
\pM - matches a character intended to be combined with another character (e.g. accents, umlauts, enclosing boxes, etc.)
\p{Zs} - matches a whitespace character that is invisible, but does take up space
u - Pattern and subject strings are treated as UTF-8

Unlike other proposed regex (such as [A-Za-zÀ-ÖØ-öø-ÿ]), this will work with all language specific characters, e.g. Šš is matched by this rule, but not matched by others on this page.

Unfortunately, natively JavaScript does not support these classes. However, you can use xregexp, e.g.

const XRegExp = require('xregexp');

const isInputRealHumanName = (input: string): boolean => {
  return XRegExp('^[\\pL\\pM-]+ [\\pL\\pM-]+$', 'u').test(input);
};

edited Feb 4, 2022 at 16:24

Peter Mortensen

31.4k22 gold badges110 silver badges134 bronze badges

answered Feb 19, 2020 at 9:51

Gajus

74.6k81 gold badges301 silver badges479 bronze badges

2 Comments

Ahmed Fasih Over a year ago

This should now work with all JS runtimes supporting Unicode property escapes! But you need to tweak it a bit, adding {} around L and M: /[\p{L}\p{M}\p{Zs}.-]+/gu. This matches Chinese characters as well, so if you want to only match Latin characters with accents, try /[\p{Script=Latin}\p{M}\p{Zs}.-]+/gu. For a large table of many useful character categories, check javascript.info/regexp-unicode

Louis-Rémi Over a year ago

@AhmedFasih your comment should be the accepted answer, as these escapes seem to now work in all major browsers…

dylankb · Accepted Answer · 2017-05-30 20:57:02Z

20

Which of these three approaches is most suited for the task?

Depends on the task :-) To match exactly all Latin characters and their accented versions, the Unicode ranges probably provide the best solution. They might be extended to all non-whitespace characters, which could be done using the \S character class.

I'm forcing a field in a UI to match the format: last_name, first_name (last [comma space] first)

The most basic problem I'm seeing here are not diacritics, but whitespaces. There are a few names that consist of multiple words, e.g. for titles. So you should go with the most generic, that is allowing everything but the comma that distinguishes first from last name:

/[^,]+,\s[^,]+/

But your second solution with the . character class is just as fine, you only might need to care about multiple commata then.

edited May 30, 2017 at 20:57

dylankb

1,20010 silver badges14 bronze badges

answered Dec 19, 2013 at 21:40

Bergi

671k162 gold badges1k silver badges1.5k bronze badges

9 Comments

Chris Cirefice Over a year ago

Hm, maybe you're right. I probably over-complicated it... Could you explain the regex you provided? I've been working with regex for a little while now, but only basic stuff, and really I don't have a clue what yours actually does! Ha

Bergi Over a year ago

It's a negated character class - meaning "anything besides the comma".

Chris Cirefice Over a year ago

Ah, so it reads more like any_character_not_a_comma, any_character_not_a_comma? That's what I thought when I first read it, I got kind of confused when I saw three commas in there.

Bergi Over a year ago

Yes exactly. Sorry for the confusion with the missing s for the whitespace…

Bergi Over a year ago

@MateoTibaquirá You can simplify [^\s] to \S

|

Peter Mortensen · Accepted Answer · 2022-02-04 16:08:09Z

17

The XRegExp library has a plugin named Unicode that helps solve tasks like this.

<script src="xregexp.js"></script>
<script src="addons/unicode/unicode-base.js"></script>
<script>
  var unicodeWord = XRegExp("^\\p{L}+$");

  unicodeWord.test("Русский"); // true
  unicodeWord.test("日本語"); // true
  unicodeWord.test("العربية"); // true
</script>

edited Feb 4, 2022 at 16:08

Peter Mortensen

31.4k22 gold badges110 silver badges134 bronze badges

answered Jan 10, 2015 at 15:50

thorn0

10.6k6 gold badges71 silver badges101 bronze badges

1 Comment

Chris Cirefice Over a year ago

Nice, turns out that I didn't actually need to regex on unicode, but rather on the pattern anything, anything. This will be useful for future readers :)

Peter Mortensen · Accepted Answer · 2022-02-04 16:23:45Z

13

You can use this:

/^[a-zA-ZÀ-ÖØ-öø-ÿ]+$/

edited Feb 4, 2022 at 16:23

Peter Mortensen

31.4k22 gold badges110 silver badges134 bronze badges

answered Jul 7, 2017 at 3:37

alchn

4571 gold badge6 silver badges17 bronze badges

3 Comments

Gajus Over a year ago

Doesn't match Šš.

pacoverflow Over a year ago

@Gajus Then just put those 2 in the character class!

Gajus Over a year ago

@pacoverflow The concern is not whether Šš are matched specifically, but if they are not matched, then the question becomes what else is not matched.

Fawaz Ahmed · Accepted Answer · 2024-11-26 06:54:45Z

10

You can remove the diacritics from alphabets by using:

let str = "résumé"
let result = str.normalize('NFD').replace(/\p{Diacritic}/gu, '') // returns resume
console.log(result)

It will remove all the diacritical marks, and then perform your regex on it.

Reference:

Searching and sorting text with diacritical marks in JavaScript

Unicode Character Class Escape

edited Nov 26, 2024 at 6:54

answered May 26, 2020 at 17:02

Fawaz Ahmed

1,6523 gold badges18 silver badges21 bronze badges

1 Comment

bigsee Over a year ago

I know the OP was asking about regex but this was a solid answer and solved the issue for me. See the current top voted answer question here for a fuller explanation.

Peter Mortensen · Accepted Answer · 2022-02-04 16:24:13Z

9

You can use this:

^([a-zA-Z]|[à-ú]|[À-Ú])+$

It will match every word with accented characters or not.

edited Feb 4, 2022 at 16:24

Peter Mortensen

31.4k22 gold badges110 silver badges134 bronze badges

answered Dec 5, 2018 at 11:56

Javier Pallarés

3492 silver badges12 bronze badges

1 Comment

barbsan Over a year ago

But OP wants to allow accented characters.

Peter Mortensen · Accepted Answer · 2022-02-04 16:23:20Z

4

From Wikipedia: Basic Latin

For Latin letters, I use

/^[A-zÀ-ÖØ-öø-ÿ]+$/

It avoids hyphens and specials characters.

edited Feb 4, 2022 at 16:23

Peter Mortensen

31.4k22 gold badges110 silver badges134 bronze badges

answered Apr 27, 2017 at 6:57

Phil

7561 gold badge12 silver badges22 bronze badges

1 Comment

JLRishe Over a year ago

This matches [, \, ], ^, _, and `.

Neonnaut · Accepted Answer · 2025-01-17 16:54:44Z

0

The following regex will match every diacritic-ed character in the Latin block, except for Latin Extended C, D, E, F, G, which are all medieval stuff

/[\u00C0-\u00C5\u00C7-\u00CF\u00D1-\u00D6\u00D9-\u00DD\u00E0-\u00E5\u00E7-\u00EF\u00F1-\u00F6\u00F8-\u00FD\u00FF\u0100-\u0130\u0134-\u0137\u0139-\u0148\u014C-\u0151\u0154-\u017E\u0180-\u0183\u0187-\u0188\u018A-\u018C\u0191-\u0193\u0197-\u019B\u019D-\u01A1\u01A4-\u01A5\u01AB-\u01B0\u01B2-\u01B6\u01BA-\u01BB\u01BE\u01CD-\u01DC\u01DE-\u01F0\u01F4-\u01F5\u01F8-\u021B\u021E-\u0221\u0224-\u0236\u023A-\u0240\u0243\u0246-\u024F\u1E00-\u1E9D\u1EA0-\u1EF9\u1EFE-\u1EFF]/

answered Jan 17 at 16:54

Neonnaut

33 bronze badges

Comments

Peter Mortensen · Accepted Answer · 2022-02-04 16:29:55Z

-1

My context is slightly different and limited to French: I want to search text by allowing a mistake of accents.

For example, I want to find "maîtrisée", but the text to be searched is "... maitrisee ...". So, I used the regular expression /ma[i|î|ï]tris[e|é|è|ê|ë]/ in JavaScript.

In the expression, the '[' and ']' define a set of characters, and the '|' is an OR condition.

This page gives a list of accented characters: Diacritiques utilisés en français

edited Feb 4, 2022 at 16:29

Peter Mortensen

31.4k22 gold badges110 silver badges134 bronze badges

answered Nov 27, 2021 at 21:15

Nicolas

172 bronze badges

2 Comments

TylerH Over a year ago

This is useful for anyone else matching the exact same word in the exact same language, but that's not what this question is about (and it's unlikely anyone else will share this extremely specific requirement of yours). Answers should directly address the question, not be orthogonally related, at best.

micapam Over a year ago

You don't need the | characters for your use case, you can just use /ma[iîï]tris[eéèêë]/

Collectives™ on Stack Overflow

Concrete JavaScript regular expression for accented characters (diacritics)

Explicitly listing all accented characters that I would want to accept as valid (lame and overly-complicated):

My other approach was to use the `.` character class, to have a simpler expression:

The last approach, which I just found might be simpler...

11 Answers 11

16 Comments

2 Comments

2 Comments

9 Comments

1 Comment

3 Comments

1 Comment

1 Comment

1 Comment

Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

Explicitly listing all accented characters that I would want to accept as valid (lame and overly-complicated):

My other approach was to use the . character class, to have a simpler expression:

The last approach, which I just found might be simpler...

11 Answers 11

16 Comments

2 Comments

2 Comments

9 Comments

1 Comment

3 Comments

1 Comment

1 Comment

1 Comment

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related

My other approach was to use the `.` character class, to have a simpler expression: