3

I need to get all the content inside the body tag of an HTML file using C#. Are there any good and effective ways of doing this?

3
  • 1
    Is this a file on disk or a webpage you're pulling down? Commented Oct 27, 2010 at 20:32
  • 1
    Sorry, just starting to accept, my mistake Commented Oct 27, 2010 at 20:41
  • and yes I'm the owner of the file that need parsing Commented Oct 27, 2010 at 20:41

7 Answers 7

9

Check out the HTML Agility Pack to do all sorts of HTML manipulation

It gives you an interface somewhat similar to the XmlDocument XML handling interface:

 HtmlDocument doc = new HtmlDocument();
 doc.Load("file.htm");

 HtmlNode bodyNode = doc.DocumentNode.SelectSingleNode("/html/body");

 if(bodyNode != null)
 {
    // do something
 }
Sign up to request clarification or add additional context in comments.

Comments

3

You may take a look at SgmlReader and HTML Agility Pack.

4 Comments

That URL to SgmlReader leads to a very old version that hasn't been touched in years. The guys maintaining SgmlReader these days are MindTouch. I would recommend SgmlReader over HtmlAgilityPack due to its lower level approach and active maintenance. developer.mindtouch.com/en/docs/SgmlReader
If your HTML isn't wellformed XHTML I think you'll find that SgmlReader (and yeah use the mindtouch version as in the comment above) is your best bet.
@asbjomu - Looking through the conversion examples on the mindtouch site, I can't find a single one where SgmlReader produces a DOM that matches what browsers do. I don't know whether HTML Agility Pack is any better, but I wasn't impressed.
@Alohci I agree that SgmlReader isn't up to par with browser parsers, but there aren't many alternatives native to C# that does it better. HtmlAgilityPack surely doesn't.
2

Its easy enough to pull the page code into a string, and simply search for the occurrence of the string "<body" and the string "</body", and just do a little math to get your value...

Comments

1

Reading the Html Structure into Html String and Getting the Body Tag content using C# without HtmlAgility packages

       private void Button_Click(object sender, RoutedEventArgs e)
        {
            string filepath = @"C:\Users\Testing\Documents\sample1.txt";
            string htmlString = File.ReadAllText(filepath);
            string htmlTagPattern = "<.*?>";
            Regex oRegex = new Regex(".*?<body.*?>(.*?)</body>.*?", RegexOptions.Multiline);
            htmlString = oRegex.Replace(htmlString, string.Empty);
            htmlString = Regex.Replace(htmlString, htmlTagPattern, string.Empty);
            htmlString = Regex.Replace(htmlString, @"^\s+$[\r\n]*", "", RegexOptions.Multiline);
            htmlString = htmlString.Replace("&nbsp;", string.Empty);
        }

1 Comment

its not getting proper result.
0

If it happens to be XHTML, then you could use XPath.

Comments

0

Use XML methods, XPath. For more advanced manipulation with html use HTML Agility pack.

Comments

0

To save you the math in the accepted answer:

var start = html.IndexOf("<body>") + "<body>".Length;
var end = html.IndexOf("</body>");
var result = html.Substring(start, end - start);

Mind that it's not 100% bulletproof:

  • It will fail on CDATA blocks containing <body>
  • It will fail if you have something like <body lang="en">

So all in all you are probably better off with the Agility Pack, unless you know for sure, which HTML you are working with.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.