26
WebClient GodLikeClient = new WebClient();
HtmlAgilityPack.HtmlDocument GodLikeHTML = new HtmlAgilityPack.HtmlDocument();

GodLikeHTML.Load(GodLikeClient.OpenRead("www.alfa.lt");

So this code returns: "Skaitytojo klausimas psichologui: kas lemia homoseksualumÄ…? - Naujienų portalas Alfa.lt" instead of "Skaitytojo klausimas psichologui: kas lemia homoseksualumą? - Naujienų portalas Alfa.lt".

This webpage is encoded in 1257 (baltic), but textBox1.Text = GodLikeHTML.DocumentNode.OuterHtml; returns the distorted text - baltic diacritics are transformed into some weird several characters long strings :(

And yes, I've tried the HtmlAgilityPack forums. They do suck.

P.S. I'm no programmer, but I work on a community project and I really need to get this code working. Thanks ;}

0

9 Answers 9

26

Actually the page is encoded with UTF-8.

GodLikeHTML.Load(GodLikeClient.OpenRead("http://www.alfa.lt"), Encoding.UTF8);

will work.

Or you could use the code in my SO answer which detects encoding from http headers or meta tags, en re-encodes properly. (It also supports gzip to minimize your download).

With the download class your code would look like:

HttpDownloader downloader = new HttpDownloader("http://www.alfa.lt",null,null);
GodLikeHTML.LoadHtml(downloader.GetPage());
Sign up to request clarification or add additional context in comments.

7 Comments

Yup, that did the job :D Wow, quite simple, isn't it. Thanks!
many thanks, got crazy characters like á or whatever, now it is working fine
It's a variable from the original question
@PierreLebon I suggest you take a look at the availble properties of the HttpWebRequest class and you will be amazed :)
I made an overload on the constructor and added into the get page request. but now future reader will have a solid hint on how to do it. Because when fighting against encoding you can miss simple thing.
|
17

I had a similar encoding problems. I fixed it, in the most current version of HtmlAgilityPack, by adding the following to my WebClient initialization.

var htmlWeb = new HtmlWeb();
htmlWeb.OverrideEncoding = Encoding.UTF8;
var doc = htmlWeb.Load("www.alfa.lt");

1 Comment

Best answer (why to use webclient when you can do it by only using HTMLAgilityPack
7

UTF8 didn't work for me, but after setting the encoding like this, most pages i was trying to scrape worked just wel:

web.OverrideEncoding = Encoding.GetEncoding("ISO-8859-1");

Perhaps it might help someone.

1 Comment

Thanks! It was weird, I had been debugging my program to do HtmWeb web = new HtmlWeb(); then doc = web.Load(nextPageUrl); and it had stopped working. I had been testing for several days. Why would it stop working? I had a bug that threw a custom exception. But ever after that, even after restarting my app from debugger several times it was giving me that weird error. Your solution fixed mine. I was looking for a way, but didn't see it.
5
 HtmlAgilityPack.HtmlDocument doc = new HtmlDocument(); 
 StreamReader reader = new StreamReader(WebRequest.Create(YourUrl).GetResponse().GetResponseStream(), Encoding.Default); //put your encoding            
 doc.Load(reader);

hope it helps :)

Comments

2

if all of those post doesn't work, Just use this: WebUtility.HtmlDecode("Your html text");

Comments

1

try to change that to GodLikeHTML.Load(GodLikeClient.OpenRead("www.alfa.lt"), Encoding.GetEncoding(1257));

1 Comment

sorry, I've misguided you - it was encoded in utf. Thanks for your help though.
1

This seemed to remove the need to know anything about encoding for me:

using System;
using HtmlAgilityPack;
using System.Net;
using System.IO;


    class Program
    {
        static void Main(string[] args)
        {
            Console.Write("Enter the url to pull html documents from: ");

            string url = Console.ReadLine();

            HtmlDocument document = new HtmlDocument();

            var request = WebRequest.Create(url);
            var response = request.GetResponse();

            using (var reader = new StreamReader(response.GetResponseStream()))
            {
                document.LoadHtml(reader.ReadToEnd());
            } 
        }
    }

Comments

0

This is my solution

 HttpWebRequest request =(HttpWebRequest)WebRequest.Create("http://www.sina.com.cn");
HttpWebResponse response =(HttpWebResponse)request.GetResponse();
long len = response.ContentLength;
byte[] barr = new byte[len]; 
response.GetResponseStream().Read(barr, 0, (int)len); 
response.Close();
string data = Encoding.UTF8.GetString(barr); 
var encod = doc.DetectEncodingHtml(data);
string convstr = Encoding.Unicode.GetString(Encoding.Convert(encod, Encoding.Unicode, barr));
doc.LoadHtml(convstr);

Comments

0

Even simpler (WebClient seems not to have any OverrideEncoding feature):

using (WebClient webClient = new WebClient())
{
    webClient.Encoding  = Encoding.UTF8;
    // do whatever you want...
}

(works for me in .NET Framework 4.8)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.