0

I have a webpage "index.html" built with Expression Web 4 containing a value delimited with an id :

<html>
<head></head>
<body>
<... some html code ...>
<!--MYVALUEID-->
Dernières News
<... some html code ...>
</body>
</html>

With my delphi application i load the page in TStringList and read the value to a TEDIT :

S:=TStringList.Create;
S.LoadFromFile('path\index.html');
Edit1.Text:=S[S.IndexOf('<!--MYVALUEID-->')+1];
S.Free;

The problem is the accent char because i got this in the TEDIT : "Dernières News"

In Expression Web code the text is correct : Dernières News

When i open index.html in notepad it show : Dernières News

The file in notepad is shown as UTF8

When using HTTPApp.HTMLDecode() i got : Dernières News

And with System.NetEncoding,TNetEncoding.HTML.Decode also : Dernières News

Is there a reliable routine to decode html special char conversion ?

I checked many question in SO and tried the solutions as mentionned above but nothing happens.

Thanks in advance, i m stuck.

2
  • You are probably using Delphi 7 and so string is ANSI encoded and you take no steps to handle the UTF8. But that's just a guess. Without details guessing is all we can do. Commented May 31, 2021 at 5:28
  • 1
    @DavidHeffernan the OP mentions System.NetEncoding.TNetEncoding, which didn't exist until Delphi XE7 Commented May 31, 2021 at 8:02

3 Answers 3

4

Since your HTML file is encoded in UTF-8, you should specify it when calling LoadFromFile():

S := TStringList.Create;
S.LoadFromFile('path\index.html', TEncoding.UTF8);

Otherwise the ANSI encoding is used.

Sign up to request clarification or add additional context in comments.

5 Comments

I just tried and got an exception "No mapping for the Unicode character exists in the target multi-byte code page" !!!
@Stalkium What version of Delphi are you using?
I am on RAD 10.3 !
@Stalkium It looks like your file is not valid UTF-8. Try doing a test with a very simple file.
I think i got it now, your solution is the simpliest and the best practise instead of Decoding and Encoding.
1

You can use UTF8ToWideString to convert an UTF8 string to a unicode string:

S := TStringList.Create;
try
    S.LoadFromFile('path\index.html');
    Edit1.Text := UTF8ToWideString(S[S.IndexOf('<!--MYVALUEID-->') + 1]);
finally
    S.Free;
end;

6 Comments

The OP is clearly using a Unicode version of Delphi, so the TStringList will be holding UTF-16 strings, not UTF-8 strings, thus calling UTF8ToWideString() will produce even worse results.
@RemyLebeau apparently it worked perfect, i will give several tries for other spec char before validating the answer
Of course, I had verified that my solution works before posting my answer.
@fpiette the only way LoadFromFile() without specifying TEncoding.UTF8 could correctly decode a UTF-8 file containing non-ASCII characters is if the file had a UTF-8 BOM. But there is no way UTF8ToWideString() would work correctly, as it takes a UTF-8 encoded RawByteString as input, but passing it a UnicodeString instead will perform a runtime conversion based on the current DefaultSystemCodePage, which is certainly not going to be set to CP_UTF8 by default.
@fpiette I don't have time or opportunity right now to verify that myself, but I can pretty much guarantee just by looking at it that the result is a fluke, not to be relied on. Without a BOM or TEncoding.UTF8, LoadFromFile() will decode that file to UTF-16 incorrectly, thus data loss occurs before UTF8ToWideString() is even called, and since the input to UTF8ToWideString() is not valid UTF-8, the result can't be depended on. Whether or not it "works" is immaterial. The logic is wrong. What you have shown can only work in pre-Unicode versions of Delphi, not in modern Unicode versions.
|
-2

I think i got the probleme but the solution lead to another problem, the file i was trying to read is "header.html" (i put index.html just for the example), and header.html will be PHP included in index file so doesn't contain any head or body info, to avoid HTML mess when included, that so for this reason it is not UTF8 encoded by Expression WEB, .... when i added a "UTF8 meta content" to the file to say to the editor to encode it, now it works.

But my problem now when i add the :

<head><meta content="text/html; charset=utf-8" http-equiv="Content-Type"></head>

The editor show a dialog for BOM removal request from header.html to avoid display blank space on the browser (which is true) so if i remove it the doc loose his UTF8 and if i keep it a blank space will be displayed on the browser,

I know this should be another question so i will remove the BOM and use fpiette solution to read the data.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.