C# Regex: Getting URL and text from multiple "a href"-tags

Question

I want to be able to scrape a webpage containing multiple "<a href"-tags and return a structured collection of them.

<div>
    <p>Lorem ipsum... <a href="https://stackoverflow">Classic link</a>
        <a title="test" href=http://sloppy-html-5-href.com>I lovez HTML 5</a>
    </p>
    <a class="abc" href='/my-tribute-to-javascript.html'>I also love JS</a>
    <iframe width="420" height="315" src="http://www.youtube.com/embed/JVPT4h_ilOU"
        frameborder="0" allowfullscreen></iframe><!-- Don't catch me! -->
</div>

So I want these values:

https://stackoverflow | Classic link
http://sloppy-html-5-href.com | I lovez HTML 5
/my-tribute-to-javascript.html | I also love JS

As you can see, only values in an "a href" should be caught, with both link and content within the tags. It should support all HTML 5-valid href. The href-attributes can be surrounded with any other attributes.

So I basically want a regex to fill in the following code:

public IEnumerable<Tuple<string, string>> GetLinks(string html) {
     string pattern = string.Empty; // TODO: Get solution from Stackoverflow
     var matches = Regex.Matches(html, pattern);

     foreach(Match match in matches) {
         yield return new Tuple<string, string>(
             match.Groups[0].Value, match.Groups[1].Value);
     }
}

"TODO: Get solution from Stackoverflow" - Really? How about "TODO: Try to figure out a solution and if I get stuck check on StackOverflow"? — nnnnnn
– nnnnnn, Commented Nov 8, 2011 at 10:58
@nnnnnn Got it, no joking allowed... very constructive comment. — Seb Nilsson
– Seb Nilsson, Commented Nov 8, 2011 at 11:49
My apologies, of course joking is allowed. In my sleep-deprived state I did not realise it was a joke or I would not have posted that comment. (I do sometimes post "What have you tried so far?" type comments, but to be fair your question provides plenty of detail of your requirement and some code so it does not fit the profile of the usual "do my work for me" questions.) — nnnnnn
– nnnnnn, Commented Nov 8, 2011 at 12:30

pierroz · Accepted Answer · 2011-11-08 10:56:19Z

4

I've always read that parsing Html with Regular Expression is the Evil. Ok... it's surely true...
But like the Evil, Regex are so fun :)
So I'd give a try to this one:

Regex r = new Regex(@"<a.*?href=(""|')(?<href>.*?)(""|').*?>(?<value>.*?)</a>");

foreach (Match match in r.Matches(html))
    yield return new Tuple<string, string>(
        match.Groups["href"].Value, match.Groups["value"].Value);

answered Nov 8, 2011 at 10:56

pierroz

7,9309 gold badges52 silver badges60 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

WKordos · Accepted Answer · 2011-11-08 10:41:15Z

3

isnt it easier to use html agility pack and xpath ? than regex

it would be like

var webGet = new HtmlWeb();
var document = webGet.Load(url); 
var aNodeCollection = document.DocumentNode.Descendants("//a[@href]")

foreach (HtmlNode node id aNodeCollection)
{
node.Attributes["href"].value
node.htmltext
}

its pseudo code

edited Nov 8, 2011 at 10:41

answered Nov 8, 2011 at 10:35

WKordos

2,2651 gold badge17 silver badges15 bronze badges

2 Comments

Seb Nilsson Over a year ago

Interesting approach, but it specifically says HTML 5, which is not necessarily valid XML.

WKordos Over a year ago

i still dont have time to dive into html5 so didnt know that it allows malformed documents (looks like step back) but i would still give it try, agility pack worked well for me even with nasty htmls, it sanitize them quite well

Collectives™ on Stack Overflow

C# Regex: Getting URL and text from multiple "a href"-tags

2 Answers 2

Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related