C# and HtmlAgilityPack encoding problem

Question

WebClient GodLikeClient = new WebClient();
HtmlAgilityPack.HtmlDocument GodLikeHTML = new HtmlAgilityPack.HtmlDocument();

GodLikeHTML.Load(GodLikeClient.OpenRead("www.alfa.lt");

So this code returns: "Skaitytojo klausimas psichologui: kas lemia homoseksualumÄ…? - NaujienÅ³ portalas Alfa.lt" instead of "Skaitytojo klausimas psichologui: kas lemia homoseksualumą? - Naujienų portalas Alfa.lt".

This webpage is encoded in 1257 (baltic), but textBox1.Text = GodLikeHTML.DocumentNode.OuterHtml; returns the distorted text - baltic diacritics are transformed into some weird several characters long strings :(

And yes, I've tried the HtmlAgilityPack forums. They do suck.

P.S. I'm no programmer, but I work on a community project and I really need to get this code working. Thanks ;}

Community · Accepted Answer · 2017-05-23 12:25:34Z

26

Actually the page is encoded with UTF-8.

GodLikeHTML.Load(GodLikeClient.OpenRead("http://www.alfa.lt"), Encoding.UTF8);

will work.

Or you could use the code in my SO answer which detects encoding from http headers or meta tags, en re-encodes properly. (It also supports gzip to minimize your download).

With the download class your code would look like:

HttpDownloader downloader = new HttpDownloader("http://www.alfa.lt",null,null);
GodLikeHTML.LoadHtml(downloader.GetPage());

edited May 23, 2017 at 12:25

CommunityBot

11 silver badge

answered Aug 10, 2010 at 19:32

Mikael Svenson

39.8k8 gold badges76 silver badges80 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

August Over a year ago

Yup, that did the job :D Wow, quite simple, isn't it. Thanks!

Andreas Reiff Over a year ago

many thanks, got crazy characters like á or whatever, now it is working fine

Mikael Svenson Over a year ago

It's a variable from the original question

Mikael Svenson Over a year ago

@PierreLebon I suggest you take a look at the availble properties of the HttpWebRequest class and you will be amazed :)

Drag and Drop Over a year ago

I made an overload on the constructor and added into the get page request. but now future reader will have a solid hint on how to do it. Because when fighting against encoding you can miss simple thing.

|

craastad · Accepted Answer · 2012-10-22 15:01:36Z

17

I had a similar encoding problems. I fixed it, in the most current version of HtmlAgilityPack, by adding the following to my WebClient initialization.

var htmlWeb = new HtmlWeb();
htmlWeb.OverrideEncoding = Encoding.UTF8;
var doc = htmlWeb.Load("www.alfa.lt");

answered Oct 22, 2012 at 15:01

craastad

6,5225 gold badges35 silver badges46 bronze badges

1 Comment

a1204773 Over a year ago

Best answer (why to use webclient when you can do it by only using HTMLAgilityPack

Irvin Dominin · Accepted Answer · 2021-01-10 14:48:16Z

7

UTF8 didn't work for me, but after setting the encoding like this, most pages i was trying to scrape worked just wel:

web.OverrideEncoding = Encoding.GetEncoding("ISO-8859-1");

Perhaps it might help someone.

edited Jan 10, 2021 at 14:48

Irvin Dominin

31k9 gold badges83 silver badges114 bronze badges

answered Jun 12, 2013 at 9:40

Tys

3,6009 gold badges56 silver badges72 bronze badges

1 Comment

John Foll Over a year ago

Thanks! It was weird, I had been debugging my program to do HtmWeb web = new HtmlWeb(); then doc = web.Load(nextPageUrl); and it had stopped working. I had been testing for several days. Why would it stop working? I had a bug that threw a custom exception. But ever after that, even after restarting my app from debugger several times it was giving me that weird error. Your solution fixed mine. I was looking for a way, but didn't see it.

Sagiv Ofek · Accepted Answer · 2011-10-02 18:45:47Z

5

 HtmlAgilityPack.HtmlDocument doc = new HtmlDocument(); 
 StreamReader reader = new StreamReader(WebRequest.Create(YourUrl).GetResponse().GetResponseStream(), Encoding.Default); //put your encoding            
 doc.Load(reader);

hope it helps :)

answered Oct 2, 2011 at 18:45

Sagiv Ofek

25.3k8 gold badges62 silver badges55 bronze badges

Comments

Ebleme · Accepted Answer · 2016-02-07 15:12:39Z

2

if all of those post doesn't work, Just use this: WebUtility.HtmlDecode("Your html text");

answered Feb 7, 2016 at 15:12

Ebleme

3072 silver badges14 bronze badges

Comments

Ilia G · Accepted Answer · 2010-08-10 19:07:44Z

1

try to change that to GodLikeHTML.Load(GodLikeClient.OpenRead("www.alfa.lt"), Encoding.GetEncoding(1257));

answered Aug 10, 2010 at 19:07

Ilia G

10.2k3 gold badges43 silver badges61 bronze badges

1 Comment

August Over a year ago

sorry, I've misguided you - it was encoded in utf. Thanks for your help though.

T-CROC · Accepted Answer · 2020-04-01 15:00:00Z

1

This seemed to remove the need to know anything about encoding for me:

using System;
using HtmlAgilityPack;
using System.Net;
using System.IO;


    class Program
    {
        static void Main(string[] args)
        {
            Console.Write("Enter the url to pull html documents from: ");

            string url = Console.ReadLine();

            HtmlDocument document = new HtmlDocument();

            var request = WebRequest.Create(url);
            var response = request.GetResponse();

            using (var reader = new StreamReader(response.GetResponseStream()))
            {
                document.LoadHtml(reader.ReadToEnd());
            } 
        }
    }

answered Apr 1, 2020 at 15:00

T-CROC

761 silver badge4 bronze badges

Comments

eliprodigy · Accepted Answer · 2015-04-07 09:55:48Z

0

This is my solution

 HttpWebRequest request =(HttpWebRequest)WebRequest.Create("http://www.sina.com.cn");
HttpWebResponse response =(HttpWebResponse)request.GetResponse();
long len = response.ContentLength;
byte[] barr = new byte[len]; 
response.GetResponseStream().Read(barr, 0, (int)len); 
response.Close();
string data = Encoding.UTF8.GetString(barr); 
var encod = doc.DetectEncodingHtml(data);
string convstr = Encoding.Unicode.GetString(Encoding.Convert(encod, Encoding.Unicode, barr));
doc.LoadHtml(convstr);

answered Apr 7, 2015 at 9:55

eliprodigy

5986 silver badges8 bronze badges

Comments

Stanislav Koncebovski · Accepted Answer · 2022-11-11 11:25:18Z

0

Even simpler (WebClient seems not to have any OverrideEncoding feature):

using (WebClient webClient = new WebClient())
{
    webClient.Encoding  = Encoding.UTF8;
    // do whatever you want...
}

(works for me in .NET Framework 4.8)

answered Nov 11, 2022 at 11:25

Stanislav Koncebovski

5405 silver badges12 bronze badges

Collectives™ on Stack Overflow

C# and HtmlAgilityPack encoding problem

9 Answers 9

7 Comments

1 Comment

1 Comment

Comments

Comments

1 Comment

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

9 Answers 9

7 Comments

1 Comment

1 Comment

Comments

Comments

1 Comment

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related