0

Possible Duplicate:
Regular expression for parsing links from a webpage?

How can I find all urls from HTML using regular expression. I need only url for pages so I want to add exclusion of urls which end with ".css" or ".jpg" or ".js" etc.

Example of HTML:

<a href=index.php?option=content&amp;task=view&amp;id=2&amp;Itemid=25 class="menu_selected" id="">Home</a>

or

<a href="http://data.stackexchange.com">data</a> |
                <a href="http://shop.stackexchange.com/">shop</a> |
                <a href="http://stackexchange.com/legal">legal</a> |

Thanks

1
  • string strRef = @"(href|HREF)[ ]*=[ ]*[""'][^""'#>]+[""']"; MatchCollection matches = new Regex(strRef).Matches(strResponse); Commented Jun 21, 2012 at 14:48

1 Answer 1

2

If you can, avoid using Regular Expressions, but instead use a proper HTML parser. For example, reference the HTML Agility Pack, and use the following:

var doc = new HtmlDocument();
doc.LoadHtml(yourHtmlInput);
foreach (HtmlNode link in doc.DocumentNode.SelectNodes("//a[@href]")
                              ?? Enumerable.Empty<HtmlNode>())
{
    string href = link.Attributes["href"].Value;
    if (!String.IsNullOrEmpty(href))
    {
        // Act on the link here, including ignoring it if it's a .jpg etc.
    }
}
Sign up to request clarification or add additional context in comments.

7 Comments

I think, regular expression will be faster than HTML Agility Pack; please correct if I am wrong
It will probably be faster; HTML Agility Pack is likely to be more robust. I really only posted this because I had the code to hand from a project I did recently :)
RegEx won't work at all: codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html . Simple rule of thumb: use Regular Expressions to parse "Regular Languages" - HTML is not a "Regular Language" (refer to the Chomsky Hierarchy for more information).
Regex might be faster but parsing HTML with regex is problematic. How many do you need to parse?
I need to parse so many pages(upto 5000). The application is multithreaded so I want each thread to finish work soon
|

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.