Read HTML special characters in Delphi string

Question

I have a webpage "index.html" built with Expression Web 4 containing a value delimited with an id :

<html>
<head></head>
<body>
<... some html code ...>
<!--MYVALUEID-->
Dernières News
<... some html code ...>
</body>
</html>

With my delphi application i load the page in TStringList and read the value to a TEDIT :

S:=TStringList.Create;
S.LoadFromFile('path\index.html');
Edit1.Text:=S[S.IndexOf('<!--MYVALUEID-->')+1];
S.Free;

The problem is the accent char because i got this in the TEDIT : "DerniÃ¨res News"

In Expression Web code the text is correct : Dernières News

When i open index.html in notepad it show : Dernières News

The file in notepad is shown as UTF8

When using HTTPApp.HTMLDecode() i got : DerniÃ¨res News

And with System.NetEncoding,TNetEncoding.HTML.Decode also : DerniÃ¨res News

Is there a reliable routine to decode html special char conversion ?

I checked many question in SO and tried the solutions as mentionned above but nothing happens.

Thanks in advance, i m stuck.

You are probably using Delphi 7 and so string is ANSI encoded and you take no steps to handle the UTF8. But that's just a guess. Without details guessing is all we can do. — David Heffernan
– David Heffernan, Commented May 31, 2021 at 5:28
@DavidHeffernan the OP mentions System.NetEncoding.TNetEncoding, which didn't exist until Delphi XE7 — Remy Lebeau
– Remy Lebeau, Commented May 31, 2021 at 8:02

Olivier · Accepted Answer · 2021-05-31 07:34:17Z

4

Since your HTML file is encoded in UTF-8, you should specify it when calling LoadFromFile():

S := TStringList.Create;
S.LoadFromFile('path\index.html', TEncoding.UTF8);

Otherwise the ANSI encoding is used.

answered May 31, 2021 at 7:34

Olivier

19.6k1 gold badge12 silver badges34 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Stalkium Over a year ago

I just tried and got an exception "No mapping for the Unicode character exists in the target multi-byte code page" !!!

Olivier Over a year ago

@Stalkium What version of Delphi are you using?

Stalkium Over a year ago

I am on RAD 10.3 !

Olivier Over a year ago

@Stalkium It looks like your file is not valid UTF-8. Try doing a test with a very simple file.

Stalkium Over a year ago

I think i got it now, your solution is the simpliest and the best practise instead of Decoding and Encoding.

fpiette · Accepted Answer · 2021-05-31 05:50:19Z

1

You can use UTF8ToWideString to convert an UTF8 string to a unicode string:

S := TStringList.Create;
try
    S.LoadFromFile('path\index.html');
    Edit1.Text := UTF8ToWideString(S[S.IndexOf('<!--MYVALUEID-->') + 1]);
finally
    S.Free;
end;

answered May 31, 2021 at 5:50

fpiette

12.5k1 gold badge34 silver badges52 bronze badges

6 Comments

Remy Lebeau Over a year ago

The OP is clearly using a Unicode version of Delphi, so the TStringList will be holding UTF-16 strings, not UTF-8 strings, thus calling UTF8ToWideString() will produce even worse results.

Stalkium Over a year ago

@RemyLebeau apparently it worked perfect, i will give several tries for other spec char before validating the answer

fpiette Over a year ago

Of course, I had verified that my solution works before posting my answer.

Remy Lebeau Over a year ago

@fpiette the only way LoadFromFile() without specifying TEncoding.UTF8 could correctly decode a UTF-8 file containing non-ASCII characters is if the file had a UTF-8 BOM. But there is no way UTF8ToWideString() would work correctly, as it takes a UTF-8 encoded RawByteString as input, but passing it a UnicodeString instead will perform a runtime conversion based on the current DefaultSystemCodePage, which is certainly not going to be set to CP_UTF8 by default.

Remy Lebeau Over a year ago

@fpiette I don't have time or opportunity right now to verify that myself, but I can pretty much guarantee just by looking at it that the result is a fluke, not to be relied on. Without a BOM or TEncoding.UTF8, LoadFromFile() will decode that file to UTF-16 incorrectly, thus data loss occurs before UTF8ToWideString() is even called, and since the input to UTF8ToWideString() is not valid UTF-8, the result can't be depended on. Whether or not it "works" is immaterial. The logic is wrong. What you have shown can only work in pre-Unicode versions of Delphi, not in modern Unicode versions.

|

Stalkium · Accepted Answer · 2021-05-31 11:31:29Z

I think i got the probleme but the solution lead to another problem, the file i was trying to read is "header.html" (i put index.html just for the example), and header.html will be PHP included in index file so doesn't contain any head or body info, to avoid HTML mess when included, that so for this reason it is not UTF8 encoded by Expression WEB, .... when i added a "UTF8 meta content" to the file to say to the editor to encode it, now it works.

But my problem now when i add the :

<head><meta content="text/html; charset=utf-8" http-equiv="Content-Type"></head>

The editor show a dialog for BOM removal request from header.html to avoid display blank space on the browser (which is true) so if i remove it the doc loose his UTF8 and if i keep it a blank space will be displayed on the browser,

I know this should be another question so i will remove the BOM and use fpiette solution to read the data.

Collectives™ on Stack Overflow

Read HTML special characters in Delphi string

3 Answers 3

5 Comments

6 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

5 Comments

6 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related