-
Notifications
You must be signed in to change notification settings - Fork 228
Description
The Dart String class is a sequence of UTF-16 code units (aka. 16-bit integers).
It has a runes getter which provides a way to iterate the string as code points. However, that is not sufficient to perform operations which treat the string content as human readable Unicode text, because the unit of representation for that is an extended grapheme cluster which can be more than one code unit. The most traditional example is the string "e\u0301" which contains only one grapheme cluster (the U+0301 code point is the [combining acute accent](accent aigu combining mark) which combines with the prior e to designate the glyph é). More complicated examples include combining emojis or country-codes (flag emojis).
Users currently cannot work with strings at the grapheme cluster level.
This leads to tricky bugs where tests work for simple examples, but the program fails badly when it encounters real-life text.
The Dart String class should, at the very least, provide a way to iterate the string as a sequence of grapheme clusters. There should probably also be other operations on the grapheme cluster sequence, so users won't have to do everything manually. The exact operations and API will need to be designed.
It might also be useful to make some changes to the String class, or add other related functionality in separate libraries or packages.
I've collected a number of ideas, wishes and concerns about such changes in a document.
A minimal solution to this issue would be a graphemeClusters getter on String which provides an iterable over "grapheme clusters". We believe this to be practically possible, even when compiled to JavaScript.