Parsing an HREF from an HTML string using a regular expression

Question

I need to parse a link to a zip file out of html. The name of this zipfile changes every month. Here is a snippet of the HTML I need to parse:

<a href="http://nppes.viva-it.com/NPPES_Data_Dissemination_Mar_2012.zip">

The string I need to get is "http://nppes.viva-it.com/NPPES_Data_Dissemination_Mar_2012.zip" so I can download the file using WebClient. The only portion of that zip file URL that remains constant from month to month is "http://nppes.viva-it.com/". Is there a way using a regular expression to parse the full URL, "http://nppes.viva-it.com/NPPES_Data_Dissemination_Mar_2012.zip", out of the HTML?

In the general case, using a regular expression to parse HTML won't work. However narrow you build the pattern, a perfectly legal HTML file can defeat it. Use a real parser — Michael Lorton
– Michael Lorton, Commented Apr 13, 2012 at 1:03

Oleks · Accepted Answer · 2012-04-13 09:33:42Z

1

By using HtmlAgilityPack:

var html = "<a href=\"http://nppes.viva-it.com/NPPES_Data_Dissemination_Mar_2012.zip\">";
var doc = new HtmlDocument();
doc.LoadHtml(html);
var anchor = doc.DocumentNode.SelectSingleNode("//a");
var href = anchor.GetAttributeValue("href", null);

now href variable holds "http://nppes.viva-it.com/NPPES_Data_Dissemination_Mar_2012.zip" value.

Isn't it simplier than regex?

answered Apr 13, 2012 at 9:33

Oleks

32.4k11 gold badges80 silver badges134 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Ry- · Accepted Answer · 2012-04-12 23:54:48Z

0

If there will only ever be one ZIP linked to on the page, no problem:

Regex re = new Regex(@"http://nppes\.viva-it\.com/.+\.zip");

re.Match(html).Value // To get the matched URL

Here's a demo.

answered Apr 12, 2012 at 23:54

Ry-♦

226k56 gold badges496 silver badges504 bronze badges

Comments

score 0 · Accepted Answer · 2012-04-13 00:54:02Z

Here is a raw regex - uses branch reset.
The answer is in capture buffer 2.

<a 
  (?=\s) 
  (?= (?:[^>"']|"[^"]*"|'[^']*')*? (?<=\s)
    href \s*=
    (?|
        (?> \s* (['"]) \s* (http://nppes\.viva-it\.com/ (?:(?!\g{-2}) .)+ \.zip ) \s*     \g{-2} )
      | (?> (?!\s*['"]) \s* () (http://nppes\.viva-it\.com/ [^\s>]* \.zip ) (?=\s|>) )
    )
  )
  \s+ (?:".*?"|'.*?'|[^>]*?)+ 
>

Not sure if C# can do branch reset. If it can't, this variation works.
The answer is always the result of capture buffer 2 catted with capture buffer 3.

<a 
  (?=\s) 
  (?= (?:[^>"']|"[^"]*"|'[^']*')*? (?<=\s)
    href \s*=
    (?:
        (?> \s* (['"]) \s* (http://nppes\.viva-it\.com/ (?:(?!\g{-2}) .)+ \.zip ) \s* \g{-2} )
      | (?> (?!\s*['"]) \s* (http://nppes\.viva-it\.com/ [^\s>]* \.zip ) (?=\s|>) )
    )
  )
  \s+ (?:".*?"|'.*?'|[^>]*?)+ 
>

Collectives™ on Stack Overflow

Parsing an HREF from an HTML string using a regular expression

3 Answers 3

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related