0

There is a lot of argument back and forth over when and if it is ever appropriate to use a regex to parse html.

As a common problem that comes up is parsing links from html my question is, would using a regex be appropriate if all you were looking for was the href value of <a> tags in a block of HTML? In this scenario you are not concerned about closing tags and you have a pretty specific structure you are looking for.

It seems like significant overkill to use a full html parser. While I have seen questions and answers indicating the using a regex to parse URLs, while largely safe is not perfect, the extra limitations of structured <a> tags would appear to provide a context where one should be able to achieve 100% accuracy without breaking a sweat.

Thoughts?

3 Answers 3

4

Consider this valid html:

<!DOCTYPE html>
<title>Test Case</title>
<p>
<!-- <a href="url1"> -->
<span class="><a href='url2'>"></span>
<a href='my">url<'>click</a>
</p>

What is the list of urls to be extracted? A parser would say just a single url with value my">url<. Would your regular expression?

Sign up to request clarification or add additional context in comments.

4 Comments

You didn't even have to get nasty there with CDATA and its ilk to present a compelling reason not to use regexes on HTML.
The html comment is a good example but your wacky class is I believe invalid html.
@Endophage - If you doubt my validity claim, it's easy to check it here: validator.w3.org/#validate_by_input . Just copy and paste my example in and click the "Check" button.
@Alohci... interesting... I've had problems before with generated html that ended up having < or > in an attribute value
2

I'm one of those people who think using regex in this situation is a bad idea.

Even if you just want to match a href attribute from a <a> tag, your regex expression will still run through the whole html document, which make any regex based solution cluttered, unsafe and bloated.

Plus, matching href attributes from tags with a XML parser is all but overkill.

I have been parsing html pages every weeks for at least 2 years now. At first, I was using full regex solutions, I was thinking it's easier and simpler than using a HTML parser.

But I had to come back on my code quite a lot, for many reasons :

  • the source code had changed
  • one of the source page had broken html and I didn't tested it
  • I didn't try my code for every pages of the source, only to find out a few of them didn't work.
  • ...

I found that fixing long regex patterns is not exactly the funniest thing, you have to put your mind over it again and again.

What I usually from now on is :

  • using tidy to clean the html source.
  • Use DOM + Xpath to actually parse the page and extract the parts I want.
  • Use regexes only on small text-only parts (like the trimed textContent of a node)

The code is far more robust, I don't have to spend 2hrs on a long regex pattern to find out why it isn't working for 1% of the sources, it just feel proper.

Now, even in cases where I'm not concerned about closing tags and I have a pretty specific structure, I'm still using DOM based solutions, to keep improving my skills with DOM libraries and just produce better code.

I don't like to see on here people who just comment "Don't use regex on html" on every html+regex tagged question, without providing sample code or something to start with.

Here is an example to match href attributes from links in PHP, just to show that using a HTML parser for those common tasks isn't overkill at all.

$dom = new DOMDocument(); 
$dom->loadHTML($html); 

// loop on every links
foreach($dom->getElementsByTagName('a') as $link) { 
    // get href attribute
    $href = $link->getAttribute('href');
    // do whatever you want with them...
}

I hope this is helping somehow.

1 Comment

Thanks for all the info. I've tried using PHP's DOM parser (I have no option to change from PHP) and for situations where I need to parse then output it's just too damn slow... It adds somewhere in the region of 4 seconds to a page load over a regex based solution.
0

I proposed this one :

<a.*?href=["'](?<url>.*?)["'].*?>(?<name>.*?)</a>

On this thread

Eventually it can fail for what can be in name.

1 Comment

Read the question fore carefully: "would using a regex be appropriate if all you were looking for was the href value of <a> tags in a block of HTML?" I already have a regex that does it. I'm looking for whether people (who typically have a kneejerk reaction against using a regex with html) would consider this a legitimate use case where a regex is the appropriate solution.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.