0

I need to parse a link to a zip file out of html. The name of this zipfile changes every month. Here is a snippet of the HTML I need to parse:

<a href="http://nppes.viva-it.com/NPPES_Data_Dissemination_Mar_2012.zip">

The string I need to get is "http://nppes.viva-it.com/NPPES_Data_Dissemination_Mar_2012.zip" so I can download the file using WebClient. The only portion of that zip file URL that remains constant from month to month is "http://nppes.viva-it.com/". Is there a way using a regular expression to parse the full URL, "http://nppes.viva-it.com/NPPES_Data_Dissemination_Mar_2012.zip", out of the HTML?

2
  • In the general case, using a regular expression to parse HTML won't work. However narrow you build the pattern, a perfectly legal HTML file can defeat it. Use a real parser Commented Apr 13, 2012 at 1:03
  • See: stackoverflow.com/questions/56107/… Commented Apr 13, 2012 at 1:50

3 Answers 3

1

By using HtmlAgilityPack:

var html = "<a href=\"http://nppes.viva-it.com/NPPES_Data_Dissemination_Mar_2012.zip\">";
var doc = new HtmlDocument();
doc.LoadHtml(html);
var anchor = doc.DocumentNode.SelectSingleNode("//a");
var href = anchor.GetAttributeValue("href", null);

now href variable holds "http://nppes.viva-it.com/NPPES_Data_Dissemination_Mar_2012.zip" value.

Isn't it simplier than regex?

Sign up to request clarification or add additional context in comments.

Comments

0

If there will only ever be one ZIP linked to on the page, no problem:

Regex re = new Regex(@"http://nppes\.viva-it\.com/.+\.zip");

re.Match(html).Value // To get the matched URL

Here's a demo.

Comments

0

Here is a raw regex - uses branch reset.
The answer is in capture buffer 2.

<a 
  (?=\s) 
  (?= (?:[^>"']|"[^"]*"|'[^']*')*? (?<=\s)
    href \s*=
    (?|
        (?> \s* (['"]) \s* (http://nppes\.viva-it\.com/ (?:(?!\g{-2}) .)+ \.zip ) \s*     \g{-2} )
      | (?> (?!\s*['"]) \s* () (http://nppes\.viva-it\.com/ [^\s>]* \.zip ) (?=\s|>) )
    )
  )
  \s+ (?:".*?"|'.*?'|[^>]*?)+ 
>

Not sure if C# can do branch reset. If it can't, this variation works.
The answer is always the result of capture buffer 2 catted with capture buffer 3.

<a 
  (?=\s) 
  (?= (?:[^>"']|"[^"]*"|'[^']*')*? (?<=\s)
    href \s*=
    (?:
        (?> \s* (['"]) \s* (http://nppes\.viva-it\.com/ (?:(?!\g{-2}) .)+ \.zip ) \s* \g{-2} )
      | (?> (?!\s*['"]) \s* (http://nppes\.viva-it\.com/ [^\s>]* \.zip ) (?=\s|>) )
    )
  )
  \s+ (?:".*?"|'.*?'|[^>]*?)+ 
>

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.