0

I have a question about Chinese encoding and saving back to a file. I am currently using the HtmlAgilityPack to parse HTML, do some things with it and save it back to the file. I am having a problem with Encoding, such as Chinese (GB2312 (Simplified)). When i open the file, I read the encoding and I save it back, using the HtmlAgilityPack

doc.Save(this._filePath, reader.CurrentEncoding);

but the Chinese letters get completely mutilated. Any ideas on how I can save back to the same file and maintain the current encoding? I also tried getting the Encoding with the HtmlAgilityPack like such:

FileStream fs = new FileStream(this._filePath, FileMode.Open);

StreamReader reader = new StreamReader(fs);

HtmlDocument doc = new HtmlDocument();
doc.Load(reader);

Encoding enc = doc.DeclaredEncoding;

fs.Close();

doc.Save(this._filePath, enc);

but that didn't work either. Any ideas?

5
  • 1
    Html encoding can be determined fromm many ways (HTTP headers, META, byte encoding, BOM, etc...) DeclaredEncoding is the one found in the META tag. Are you sure this file declares a META? Otherwise can you give the url of a sample file that exibits this behavior? Commented Mar 18, 2011 at 20:54
  • Simon, you clued me in on something. You are very correct, DeclaredEncoding does pull the data out of the Meta Tag. So I began some investigative work, and I noticed that the meta tag is badly formed. So Agility Pack doesn't want to pick it up. I'll have to do some RegEx to pull out the Encoding. Thanks for the tip! Commented Mar 19, 2011 at 1:45
  • ok i figured it out. Took some doing. It was a whole bunch of stuff. Thanks for the clue. That's what triggered a whole bunch of thoughts. Commented Mar 19, 2011 at 3:24
  • you should answer yourself so the question is marked answered - it's the "self learning" badge thing :-) Commented Mar 21, 2011 at 8:20
  • @Simon Mourier, I did as you suggested. It's not really a concrete answer, but I figured maybe it will clue someone onto something. Commented Mar 27, 2011 at 16:10

1 Answer 1

1

So after some work, I managed to get it to work by reading the Declared encoding out of the Meta tag. I though it was badly formed initially, but actually it was correct. The DeclaredEncoding did read the encoding from the meta tag.

When the file saved, it still saved in ANSI format, and I couldn't seem to change that. However, the meta tag encoding did seem to keep the file in check when it rendered in the browser. Hope that helps someone.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.