Delphi 11 - controlling interaction between ASCII and UTF-8 in TStrings eg. in Memo

Question

If I copy and paste some UTF-8 text [eg. “Wands!”] into a TMemo, it displays as expected.

If I generate a string containing the 3 bytes (as characters) for '“' (ie 0xE2, 0x80, 0x9C) and use Memo1.Lines.Add(x), it displays as 'â' (0xE2 in extended ASCII) which it has stored as 0xC3, 0xA2 (UTF-8). The other two bytes of the string are stored as 0xC2, 0x80 & 0xC2, 0x9C.

How can I indicate that the string that I am adding already has UTF-8 multi-byte characters? And why is the string pasted into the Memo not treated the same way?

I am trying to process text extracted from ePub files. Originally the idea was to generate sort versions of text containing characters with diacritics by replacing them with the un-accented characters, but I ran into this problem of inconsistent displays.

Remy Lebeau · Accepted Answer · 2024-09-06 17:30:53Z

TMemo (and more generally, TStrings) works with Delphi's native string type only, which in Delphi 2009+ is a UTF-16 encoded UnicodeString.

Since the Add() method in your case expects a normal UTF-16 UnicodeString, you can't add UTF-8 encoded bytes using this method.

If you have UTF-8 bytes in memory, you have to either:

decode the UTF-8 first, such as with TEncoding.UTF8.GetString(), eg:
```
Memo1.Lines.Add(TEncoding.UTF8.GetString(utf8Bytes));
```

put the UTF-8 bytes into a UTF8String, which the RTL can decode into a UnicodeString, eg:

var utf8Str: UTF8String;
SetString(utf8Str, PAnsiChar(utf8Bytes), utf8Length);
Memo1.Lines.Add(string(utf8Str));

As for why things work ok when copy/pasting, it is because the text is extracted from the clipboard as UTF-16 when pasted into TMemo. The copier has to choose whether to place text on the clipboard using either the ANSI (CF_TEXT) or UTF-16 (CF_UNICODETEXT) format (the clipboard doesn't natively support UTF-8, but the copier can use CF_LOCALE to specify a locale when using CF_TEXT). The clipboard automatically converts the text to UTF-16 if it is not already in UTF-16.

Best practice is to convert data to/from UTF-16 at the boundaries where the data enters/leaves your app, and then operate only with UTF-16 in memory.

Collectives™ on Stack Overflow

Delphi 11 - controlling interaction between ASCII and UTF-8 in TStrings eg. in Memo

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related