1

I am trying to get a page title from page source of different pages. But lets say some pages have title like this:

"This is an example," ABC.

It has some html in it like """. If i use string in c# to get this title i get the whole thing and while displaying it displays it like above which is wrong. Is there any way to ignore or to take into account html values in c#?

I am also using htmlagilitypack so anything in that will do too.

2 Answers 2

3

You can use WebUtility.HtmlDecode to decode html, link on MSDN:

WebUtility.HtmlDecode(""This is an example," ABC.");

just use:

using System.Net;

The result will be: "\"This is an example,\" ABC."

You also can use HtmlEntity.DeEntitize in HTML Agility Pack:

HtmlEntity.DeEntitize(string text)
Sign up to request clarification or add additional context in comments.

1 Comment

So i will have to download WebUtility? Is there anything in HTML Agility Pack to do this kind of thing?
0

You don't know what you can find in the page title. Sometimes is a whole mess there. My suggestion is to get the string as it is and process it before to show/save it.

In this case, the solution is simple: replace the

"

with corresponding char.

Each time you read a HTML document to extract some tags, take care to tags never closed. If the user forget to close the title tag... you'll get in that line the whole page!

2 Comments

I mean i can do that for this particular string. But if i run 1000 queries i will get lot of different characters how will i replace them all? There should be an easy way to convert right?
The possibilities are limited. Read about special html characters: utexas.edu/learn/html/spchar.html and implement some replacement method(s).

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.