0

There are some posts regarding encoding questions and HtmlAgilityPack but this issue wasn't addressed:

Because the website I try to parse contains Unicode symbols like or ä, ü I tried to set the encoding to Unicode:

public class WebpageDeserializer
{
    public WebpageDeserializer() {}

    /*
     * Example address: https://www.dslr-forum.de/showthread.php?t=1930368
    */
    public static void Deserialize(string address)
    {
        var web = new HtmlWeb();
        web.OverrideEncoding = Encoding.Unicode;
        var htmlDoc = web.Load(address);
        //further decoding fails because unicode decoded characters are not proper html (looks more like chinese)
    }
}

But now

htmlDoc.DocumentNode.InnerHtml

looks like this:

ℼ佄呃偙⁅瑨汭倠䉕䥌⁃ⴢ⼯㍗⽃䐯䑔堠呈䱍ㄠ〮吠慲獮瑩潩慮⽬䔯≎...

If I try to use UTF-8 or iso-8859-1 the symbol is converted to (as well as ä, ö, ü). How can I fix this?

7
  • I updated my code example - I hope it contains everything you need. If not - please don't hesitate to ask for more information. Commented Dec 8, 2018 at 21:57
  • I try to reproduce the problem and LinqPad is freezing... Commented Dec 8, 2018 at 22:11
  • @FalcoAlexander than use other tools. Why do you think we should be interested in the tools you are using. Commented Dec 8, 2018 at 22:13
  • 1
    @FalcoAlexander Yes, I use Visual studio and write codes to test the questions :) Commented Dec 8, 2018 at 22:22
  • 1
    also tested with RoslynPad and can reproduce the strange behaviour with huge "chinese" response that freezes the .NET host Commented Dec 8, 2018 at 22:25

2 Answers 2

1

Your site is mis-configured and the real encoding is cp1252.

Below code should work:

var client = new HttpClient();
var buf = await client.GetByteArrayAsync("https://www.dslr-forum.de/showthread.php?t=1930368");
var html = Encoding.GetEncoding(1252).GetString(buf);
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks for your reply. Encoding.GetEncoding(1252); gives me a System.NotSupportedException. Do I have to configure something to get this encoding? I am using .NET Core 2.1 and Windows 10 64-bit. Edit: This fixed it: stackoverflow.com/questions/37870084/… Thanks alot!
0

instead Encoding.Unicode use:

web.OverrideEncoding = Encoding.GetEncoding("iso-8859-1");

(tested with your website and german umlauts)

to get the right encoding check the header of the target website. it contains the right hint:

<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">

1 Comment

This didn't work for me because the sign or ä ü are just removed.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.