0

Here is a sample

<tr>
  <td>
    <div class="VBChap"></div>
    <a href="/testing/1">Sample Textbook Chapter 1</a> : Introduction to VB.net
  </td>
  <td>09/24/2013</td>
</tr>

The document basically consists of these entries repeated over and over

I would like to extract the following:

  1. the partial URL after href=".
  2. The Chapter text
  3. The Chapter Name
  4. The Date

Currently I am using two separate queries to get the data

Query 1:

(?<=^|>)[^><]+?(?=<|$)

This extracts 2, 3 and 4.

Query 2:

(?<=<a href=")[^"]+

This extracts 1.

I want a single query that can extract all four.

Regex is something I am not good at. It took me 2 hours of trial and error to get this.

1
  • 5
    Use the HTML AgilityPack for this, don't use a regular expression. Commented Nov 21, 2013 at 9:05

3 Answers 3

1

RegEx and HTML is a pain. If you have the scope to use it then the HTML Agility Pack is what you want. I wrote a quick intro into its use a couple of years ago.

Sign up to request clarification or add additional context in comments.

1 Comment

I am trying it now. ATM it isn't well documented but looks promising.
0

Consider the following Regex...

((?<=href\=\").*?(?=\")|(?<=href\=\".*?\"\>).*?(?=\<)|(?<=\</.*?\>)[\s\S]*?(?=\<)|(?<=\<td\>).*?(?=</td\>))

Good Luck!

2 Comments

This code does not retreive the chapter title. It retreives the rest. But I must thank you for the effort you put into this.
And to be honest, it's way better than the match pattern I used.
0

If the HTML in question is valid XHTML you can parse it as XML, for which there is extensive support under System.XML.

You could then query with XPath;

...SelectNodes("//tr/td/a/@href").Value

and so on!

Most html on the internet is not valid xhtml, however, in which case HAP is very pleasant to use (and still allows querying by XPath, should you so choose)

2 Comments

I can prune the section of the document and format it as a valid html. That's the last resort if everything else fails.
You might consider checking out HAP then. It offers classes just like those in System.XML for handling HTML parsing. It will handle dodgy HTML gracefully, so shouldn't be any great need to prune.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.