I need to get all the content inside the body tag of an HTML file using C#. Are there any good and effective ways of doing this?
-
1Is this a file on disk or a webpage you're pulling down?Roman– Roman2010-10-27 20:32:50 +00:00Commented Oct 27, 2010 at 20:32
-
1Sorry, just starting to accept, my mistakeRasmus Christensen– Rasmus Christensen2010-10-27 20:41:21 +00:00Commented Oct 27, 2010 at 20:41
-
and yes I'm the owner of the file that need parsingRasmus Christensen– Rasmus Christensen2010-10-27 20:41:40 +00:00Commented Oct 27, 2010 at 20:41
Add a comment
|
7 Answers
Check out the HTML Agility Pack to do all sorts of HTML manipulation
It gives you an interface somewhat similar to the XmlDocument XML handling interface:
HtmlDocument doc = new HtmlDocument();
doc.Load("file.htm");
HtmlNode bodyNode = doc.DocumentNode.SelectSingleNode("/html/body");
if(bodyNode != null)
{
// do something
}
Comments
You may take a look at SgmlReader and HTML Agility Pack.
4 Comments
Asbjørn Ulsberg
That URL to SgmlReader leads to a very old version that hasn't been touched in years. The guys maintaining SgmlReader these days are MindTouch. I would recommend SgmlReader over HtmlAgilityPack due to its lower level approach and active maintenance. developer.mindtouch.com/en/docs/SgmlReader
nrkn
If your HTML isn't wellformed XHTML I think you'll find that SgmlReader (and yeah use the mindtouch version as in the comment above) is your best bet.
Alohci
@asbjomu - Looking through the conversion examples on the mindtouch site, I can't find a single one where SgmlReader produces a DOM that matches what browsers do. I don't know whether HTML Agility Pack is any better, but I wasn't impressed.
Asbjørn Ulsberg
@Alohci I agree that SgmlReader isn't up to par with browser parsers, but there aren't many alternatives native to C# that does it better. HtmlAgilityPack surely doesn't.
Reading the Html Structure into Html String and Getting the Body Tag content using C# without HtmlAgility packages
private void Button_Click(object sender, RoutedEventArgs e)
{
string filepath = @"C:\Users\Testing\Documents\sample1.txt";
string htmlString = File.ReadAllText(filepath);
string htmlTagPattern = "<.*?>";
Regex oRegex = new Regex(".*?<body.*?>(.*?)</body>.*?", RegexOptions.Multiline);
htmlString = oRegex.Replace(htmlString, string.Empty);
htmlString = Regex.Replace(htmlString, htmlTagPattern, string.Empty);
htmlString = Regex.Replace(htmlString, @"^\s+$[\r\n]*", "", RegexOptions.Multiline);
htmlString = htmlString.Replace(" ", string.Empty);
}
1 Comment
Naveen
its not getting proper result.
To save you the math in the accepted answer:
var start = html.IndexOf("<body>") + "<body>".Length;
var end = html.IndexOf("</body>");
var result = html.Substring(start, end - start);
Mind that it's not 100% bulletproof:
- It will fail on CDATA blocks containing
<body> - It will fail if you have something like
<body lang="en">
So all in all you are probably better off with the Agility Pack, unless you know for sure, which HTML you are working with.