Read <body> tag of HTML file using C#

Question

I need to get all the content inside the body tag of an HTML file using C#. Are there any good and effective ways of doing this?

Is this a file on disk or a webpage you're pulling down?

Roman
– Roman

2010-10-27 20:32:50 +00:00
Commented Oct 27, 2010 at 20:32 — Roman
– Roman, Commented Oct 27, 2010 at 20:32
Sorry, just starting to accept, my mistake

Rasmus Christensen
– Rasmus Christensen

2010-10-27 20:41:21 +00:00
Commented Oct 27, 2010 at 20:41 — Rasmus Christensen
– Rasmus Christensen, Commented Oct 27, 2010 at 20:41
and yes I'm the owner of the file that need parsing

Rasmus Christensen
– Rasmus Christensen

2010-10-27 20:41:40 +00:00
Commented Oct 27, 2010 at 20:41 — Rasmus Christensen
– Rasmus Christensen, Commented Oct 27, 2010 at 20:41

carla · Accepted Answer · 2017-11-24 20:44:29Z

9

Check out the HTML Agility Pack to do all sorts of HTML manipulation

It gives you an interface somewhat similar to the XmlDocument XML handling interface:

 HtmlDocument doc = new HtmlDocument();
 doc.Load("file.htm");

 HtmlNode bodyNode = doc.DocumentNode.SelectSingleNode("/html/body");

 if(bodyNode != null)
 {
    // do something
 }

edited Nov 24, 2017 at 20:44

carla

2,1471 gold badge34 silver badges48 bronze badges

answered Oct 27, 2010 at 20:34

marc_s

760k186 gold badges1.4k silver badges1.5k bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Darin Dimitrov · Accepted Answer · 2010-10-27 20:34:32Z

3

You may take a look at SgmlReader and HTML Agility Pack.

answered Oct 27, 2010 at 20:34

Darin Dimitrov

1.0m275 gold badges3.3k silver badges3k bronze badges

4 Comments

Asbjørn Ulsberg Over a year ago

That URL to SgmlReader leads to a very old version that hasn't been touched in years. The guys maintaining SgmlReader these days are MindTouch. I would recommend SgmlReader over HtmlAgilityPack due to its lower level approach and active maintenance. developer.mindtouch.com/en/docs/SgmlReader

nrkn Over a year ago

If your HTML isn't wellformed XHTML I think you'll find that SgmlReader (and yeah use the mindtouch version as in the comment above) is your best bet.

Alohci Over a year ago

@asbjomu - Looking through the conversion examples on the mindtouch site, I can't find a single one where SgmlReader produces a DOM that matches what browsers do. I don't know whether HTML Agility Pack is any better, but I wasn't impressed.

Asbjørn Ulsberg Over a year ago

@Alohci I agree that SgmlReader isn't up to par with browser parsers, but there aren't many alternatives native to C# that does it better. HtmlAgilityPack surely doesn't.

Dutchie432 · Accepted Answer · 2010-10-27 20:36:10Z

2

Its easy enough to pull the page code into a string, and simply search for the occurrence of the string "<body" and the string "</body", and just do a little math to get your value...

answered Oct 27, 2010 at 20:36

Dutchie432

29.3k20 gold badges94 silver badges110 bronze badges

Comments

Maghalakshmi Saravana · Accepted Answer · 2020-02-18 12:32:41Z

1

Reading the Html Structure into Html String and Getting the Body Tag content using C# without HtmlAgility packages

       private void Button_Click(object sender, RoutedEventArgs e)
        {
            string filepath = @"C:\Users\Testing\Documents\sample1.txt";
            string htmlString = File.ReadAllText(filepath);
            string htmlTagPattern = "<.*?>";
            Regex oRegex = new Regex(".*?<body.*?>(.*?)</body>.*?", RegexOptions.Multiline);
            htmlString = oRegex.Replace(htmlString, string.Empty);
            htmlString = Regex.Replace(htmlString, htmlTagPattern, string.Empty);
            htmlString = Regex.Replace(htmlString, @"^\s+$[\r\n]*", "", RegexOptions.Multiline);
            htmlString = htmlString.Replace("&nbsp;", string.Empty);
        }

answered Feb 18, 2020 at 12:32

Maghalakshmi Saravana

82110 silver badges10 bronze badges

1 Comment

Naveen Over a year ago

its not getting proper result.

Bryan · Accepted Answer · 2010-10-27 20:58:12Z

0

If it happens to be XHTML, then you could use XPath.

answered Oct 27, 2010 at 20:58

Bryan

2,7993 gold badges31 silver badges40 bronze badges

Comments

Tomas Voracek · Accepted Answer · 2010-11-20 01:02:26Z

0

Use XML methods, XPath. For more advanced manipulation with html use HTML Agility pack.

edited Nov 20, 2010 at 1:02

answered Oct 27, 2010 at 21:01

Tomas Voracek

5,9441 gold badge28 silver badges41 bronze badges

Comments

Maxim Zabolotskikh · Accepted Answer · 2023-03-09 10:02:31Z

0

To save you the math in the accepted answer:

var start = html.IndexOf("<body>") + "<body>".Length;
var end = html.IndexOf("</body>");
var result = html.Substring(start, end - start);

Mind that it's not 100% bulletproof:

It will fail on CDATA blocks containing <body>
It will fail if you have something like <body lang="en">

So all in all you are probably better off with the Agility Pack, unless you know for sure, which HTML you are working with.

edited Mar 9, 2023 at 10:02

answered Mar 9, 2023 at 9:28

Maxim Zabolotskikh

3,45722 silver badges24 bronze badges

Collectives™ on Stack Overflow

Read <body> tag of HTML file using C#

7 Answers 7

Comments

4 Comments

Comments

1 Comment

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

7 Answers 7

Comments

4 Comments

Comments

1 Comment

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related