I want to be able to scrape a webpage containing multiple "<a href"-tags and return a structured collection of them.
<div>
<p>Lorem ipsum... <a href="https://stackoverflow">Classic link</a>
<a title="test" href=http://sloppy-html-5-href.com>I lovez HTML 5</a>
</p>
<a class="abc" href='/my-tribute-to-javascript.html'>I also love JS</a>
<iframe width="420" height="315" src="http://www.youtube.com/embed/JVPT4h_ilOU"
frameborder="0" allowfullscreen></iframe><!-- Don't catch me! -->
</div>
So I want these values:
- https://stackoverflow | Classic link
- http://sloppy-html-5-href.com | I lovez HTML 5
- /my-tribute-to-javascript.html | I also love JS
As you can see, only values in an "a href" should be caught, with both link and content within the tags. It should support all HTML 5-valid href. The href-attributes can be surrounded with any other attributes.
So I basically want a regex to fill in the following code:
public IEnumerable<Tuple<string, string>> GetLinks(string html) {
string pattern = string.Empty; // TODO: Get solution from Stackoverflow
var matches = Regex.Matches(html, pattern);
foreach(Match match in matches) {
yield return new Tuple<string, string>(
match.Groups[0].Value, match.Groups[1].Value);
}
}