extract data from HTML snippet using Regex in .net language

Question

Here is a sample

<tr>
  <td>
    <div class="VBChap"></div>
    <a href="/testing/1">Sample Textbook Chapter 1</a> : Introduction to VB.net
  </td>
  <td>09/24/2013</td>
</tr>

The document basically consists of these entries repeated over and over

I would like to extract the following:

the partial URL after href=".
The Chapter text
The Chapter Name
The Date

Currently I am using two separate queries to get the data

Query 1:

(?<=^|>)[^><]+?(?=<|$)

This extracts 2, 3 and 4.

Query 2:

(?<=<a href=")[^"]+

This extracts 1.

I want a single query that can extract all four.

Regex is something I am not good at. It took me 2 hours of trial and error to get this.

Use the HTML AgilityPack for this, don't use a regular expression. — Ibrahim Najjar
– Ibrahim Najjar, Commented Nov 21, 2013 at 9:05

Colin Mackay · Accepted Answer · 2013-11-21 09:57:16Z

1

RegEx and HTML is a pain. If you have the scope to use it then the HTML Agility Pack is what you want. I wrote a quick intro into its use a couple of years ago.

answered Nov 21, 2013 at 9:57

Colin Mackay

19.3k8 gold badges66 silver badges91 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

VRM Over a year ago

I am trying it now. ATM it isn't well documented but looks promising.

gpmurthy · Accepted Answer · 2013-11-23 03:05:27Z

0

Consider the following Regex...

((?<=href\=\").*?(?=\")|(?<=href\=\".*?\"\>).*?(?=\<)|(?<=\</.*?\>)[\s\S]*?(?=\<)|(?<=\<td\>).*?(?=</td\>))

Good Luck!

edited Nov 23, 2013 at 3:05

answered Nov 21, 2013 at 23:13

gpmurthy

2,42721 silver badges21 bronze badges

2 Comments

VRM Over a year ago

This code does not retreive the chapter title. It retreives the rest. But I must thank you for the effort you put into this.

VRM Over a year ago

And to be honest, it's way better than the match pattern I used.

ttrmw · Accepted Answer · 2013-11-21 10:56:37Z

0

If the HTML in question is valid XHTML you can parse it as XML, for which there is extensive support under System.XML.

You could then query with XPath;

...SelectNodes("//tr/td/a/@href").Value

and so on!

Most html on the internet is not valid xhtml, however, in which case HAP is very pleasant to use (and still allows querying by XPath, should you so choose)

answered Nov 21, 2013 at 10:56

ttrmw

1611 silver badge12 bronze badges

2 Comments

VRM Over a year ago

I can prune the section of the document and format it as a valid html. That's the last resort if everything else fails.

ttrmw Over a year ago

You might consider checking out HAP then. It offers classes just like those in System.XML for handling HTML parsing. It will handle dodgy HTML gracefully, so shouldn't be any great need to prune.

Collectives™ on Stack Overflow

extract data from HTML snippet using Regex in .net language

3 Answers 3

1 Comment

2 Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

2 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related