0

i want to extract url from href of a webpage...for that i m using the regex pattern as "(?(http:[/][/]|www.)([a-z]|[A-Z]|[0-9]|[/.]|[~])*)"

to extract the href from html i used this pattern @"href=\""(?[^\""#]?(?=[\""#]))(?(?#{2}[^#]?#{2})*)(?#[^""]+)?"""

but the problem is...it do not extract urls from the href but urls like "www.seo-sem.com"..and in the result i only get.."www.seo"...after the hyphen it gets truncated...plz could u sugest a better regex pattern to extract url from href..will be thankful to u...

5
  • 3
    Don't use regex to parse HTML. Find a simple library like HTMLAgilityPack and use that. Commented May 10, 2010 at 17:55
  • No one posted the link yet? :) Commented May 10, 2010 at 17:56
  • Even for basic URI matching the regular expression needed is Ugly (yes, capital U). Commented May 10, 2010 at 17:57
  • @rebus, well, it's not so much HTML parsing, actually. It doesn't try to do anything with the actual structure of the document. For simply grabbing anything that looks like href='url' regex may just be appropriate enough. Commented May 10, 2010 at 17:58
  • (http://|https://)?([\w.-]+)?([\w-]+\.[\w-]+) with \2 and \3 backrefs referencing subdomains and domain respectively would help probably, but by no means would it catch all possible domain names out there. Commented May 10, 2010 at 18:25

1 Answer 1

4

Use the HTML Agility Pack to parse your HTML. You can query it using Xpath, as it parses the HTML into a XmlDocument like object.

See this for reasons not to parse HTML with regular expressions.

Sign up to request clarification or add additional context in comments.

1 Comment

i resolved the hyphen issue...edited regex..thanks anyways..u all rock..keep it up

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.