Acceptable use of Regex in HTML parsing?

Question

There is a lot of argument back and forth over when and if it is ever appropriate to use a regex to parse html.

As a common problem that comes up is parsing links from html my question is, would using a regex be appropriate if all you were looking for was the href value of <a> tags in a block of HTML? In this scenario you are not concerned about closing tags and you have a pretty specific structure you are looking for.

It seems like significant overkill to use a full html parser. While I have seen questions and answers indicating the using a regex to parse URLs, while largely safe is not perfect, the extra limitations of structured <a> tags would appear to provide a context where one should be able to achieve 100% accuracy without breaking a sweat.

Thoughts?

Alohci · Accepted Answer · 2011-03-10 01:07:48Z

4

Consider this valid html:

<!DOCTYPE html>
<title>Test Case</title>
<p>
<!-- <a href="url1"> -->
<span class="><a href='url2'>"></span>
<a href='my">url<'>click</a>
</p>

What is the list of urls to be extracted? A parser would say just a single url with value my">url<. Would your regular expression?

edited Mar 10, 2011 at 1:07

answered Mar 10, 2011 at 0:51

Alohci

84.2k16 gold badges120 silver badges164 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Borealid Over a year ago

You didn't even have to get nasty there with CDATA and its ilk to present a compelling reason not to use regexes on HTML.

Endophage Over a year ago

The html comment is a good example but your wacky class is I believe invalid html.

Alohci Over a year ago

@Endophage - If you doubt my validity claim, it's easy to check it here: validator.w3.org/#validate_by_input . Just copy and paste my example in and click the "Check" button.

Endophage Over a year ago

@Alohci... interesting... I've had problems before with generated html that ended up having < or > in an attribute value

Yann Milin · Accepted Answer · 2011-03-10 11:59:50Z

I'm one of those people who think using regex in this situation is a bad idea.

Even if you just want to match a href attribute from a <a> tag, your regex expression will still run through the whole html document, which make any regex based solution cluttered, unsafe and bloated.

Plus, matching href attributes from tags with a XML parser is all but overkill.

I have been parsing html pages every weeks for at least 2 years now. At first, I was using full regex solutions, I was thinking it's easier and simpler than using a HTML parser.

But I had to come back on my code quite a lot, for many reasons :

the source code had changed
one of the source page had broken html and I didn't tested it
I didn't try my code for every pages of the source, only to find out a few of them didn't work.
...

I found that fixing long regex patterns is not exactly the funniest thing, you have to put your mind over it again and again.

What I usually from now on is :

using tidy to clean the html source.
Use DOM + Xpath to actually parse the page and extract the parts I want.
Use regexes only on small text-only parts (like the trimed textContent of a node)

The code is far more robust, I don't have to spend 2hrs on a long regex pattern to find out why it isn't working for 1% of the sources, it just feel proper.

Now, even in cases where I'm not concerned about closing tags and I have a pretty specific structure, I'm still using DOM based solutions, to keep improving my skills with DOM libraries and just produce better code.

I don't like to see on here people who just comment "Don't use regex on html" on every html+regex tagged question, without providing sample code or something to start with.

Here is an example to match href attributes from links in PHP, just to show that using a HTML parser for those common tasks isn't overkill at all.

$dom = new DOMDocument(); 
$dom->loadHTML($html); 

// loop on every links
foreach($dom->getElementsByTagName('a') as $link) { 
    // get href attribute
    $href = $link->getAttribute('href');
    // do whatever you want with them...
}

I hope this is helping somehow.

Thanks for all the info. I've tried using PHP's DOM parser (I have no option to change from PHP) and for situations where I need to parse then output it's just too damn slow... It adds somewhere in the region of 4 seconds to a page load over a regex based solution.

Community · Accepted Answer · 2017-05-23 11:47:49Z

0

I proposed this one :

<a.*?href=["'](?<url>.*?)["'].*?>(?<name>.*?)</a>

On this thread

Eventually it can fail for what can be in name.

edited May 23, 2017 at 11:47

CommunityBot

11 silver badge

answered Mar 9, 2011 at 22:51

M'vy

5,7842 gold badges33 silver badges44 bronze badges

1 Comment

Endophage Over a year ago

Read the question fore carefully: "would using a regex be appropriate if all you were looking for was the href value of <a> tags in a block of HTML?" I already have a regex that does it. I'm looking for whether people (who typically have a kneejerk reaction against using a regex with html) would consider this a legitimate use case where a regex is the appropriate solution.

Collectives™ on Stack Overflow

Acceptable use of Regex in HTML parsing?

3 Answers 3

4 Comments

1 Comment

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

4 Comments

1 Comment

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related