3

I want to be able to scrape a webpage containing multiple "<a href"-tags and return a structured collection of them.

<div>
    <p>Lorem ipsum... <a href="https://stackoverflow">Classic link</a>
        <a title="test" href=http://sloppy-html-5-href.com>I lovez HTML 5</a>
    </p>
    <a class="abc" href='/my-tribute-to-javascript.html'>I also love JS</a>
    <iframe width="420" height="315" src="http://www.youtube.com/embed/JVPT4h_ilOU"
        frameborder="0" allowfullscreen></iframe><!-- Don't catch me! -->
</div>

So I want these values:

As you can see, only values in an "a href" should be caught, with both link and content within the tags. It should support all HTML 5-valid href. The href-attributes can be surrounded with any other attributes.

So I basically want a regex to fill in the following code:

public IEnumerable<Tuple<string, string>> GetLinks(string html) {
     string pattern = string.Empty; // TODO: Get solution from Stackoverflow
     var matches = Regex.Matches(html, pattern);

     foreach(Match match in matches) {
         yield return new Tuple<string, string>(
             match.Groups[0].Value, match.Groups[1].Value);
     }
}
3
  • 1
    "TODO: Get solution from Stackoverflow" - Really? How about "TODO: Try to figure out a solution and if I get stuck check on StackOverflow"? Commented Nov 8, 2011 at 10:58
  • @nnnnnn Got it, no joking allowed... very constructive comment. Commented Nov 8, 2011 at 11:49
  • My apologies, of course joking is allowed. In my sleep-deprived state I did not realise it was a joke or I would not have posted that comment. (I do sometimes post "What have you tried so far?" type comments, but to be fair your question provides plenty of detail of your requirement and some code so it does not fit the profile of the usual "do my work for me" questions.) Commented Nov 8, 2011 at 12:30

2 Answers 2

4

I've always read that parsing Html with Regular Expression is the Evil. Ok... it's surely true...
But like the Evil, Regex are so fun :)
So I'd give a try to this one:

Regex r = new Regex(@"<a.*?href=(""|')(?<href>.*?)(""|').*?>(?<value>.*?)</a>");

foreach (Match match in r.Matches(html))
    yield return new Tuple<string, string>(
        match.Groups["href"].Value, match.Groups["value"].Value);
Sign up to request clarification or add additional context in comments.

Comments

3

isnt it easier to use html agility pack and xpath ? than regex

it would be like

var webGet = new HtmlWeb();
var document = webGet.Load(url); 
var aNodeCollection = document.DocumentNode.Descendants("//a[@href]")

foreach (HtmlNode node id aNodeCollection)
{
node.Attributes["href"].value
node.htmltext
}

its pseudo code

2 Comments

Interesting approach, but it specifically says HTML 5, which is not necessarily valid XML.
i still dont have time to dive into html5 so didnt know that it allows malformed documents (looks like step back) but i would still give it try, agility pack worked well for me even with nasty htmls, it sanitize them quite well

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.