HTML Regexp Selector

Question

I a real big noobie when it comes to regexp, so please bear with me. I would like create a regular expression which can select all HTML tags. I have the following selector...

/<([A-Z][A-Z0-9]*)\\b[^>]*>(.*?)</\\1>/gi

... which works great for tags like this...

<p>Paragraph</p>
<span>Span</span>
<p><a href="link.php">Link</a></p>

... but it can't select tags like this:

<img src="picture.jpg" />

Could someone please direct me as to how I could fix the regular expression above so that I could select both styles of HTML tags in one clean move?

Thank your for your time,
spryno724

While a direct opposite of stackoverflow.com/questions/1732348/…, both questions have the same answer. — BoltClock
– BoltClock, Commented Apr 26, 2011 at 17:32
A comedy comment that does nothing to help the user is just plain mean. — tchrist
– tchrist, Commented Apr 26, 2011 at 19:51
It isn't very clear what is your goal. You want to "select all HTML tags" - from where? How will you use them? If you have an HTML file, all tags are contained whiting the <body> and <html> tags. Also, your pattern fails when dealing with nested tags: <i><i></i></i>. — Kobi
– Kobi, Commented Apr 26, 2011 at 20:03

omninonsense · Accepted Answer · 2011-04-26 17:38:52Z

1

Hmm. Okay, so you're looking for something like:

/</?([a-z][a-z0-9]*)[^<>]*>/

answered Apr 26, 2011 at 17:38

omninonsense

7,00210 gold badges48 silver badges67 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Oliver Spryn Over a year ago

Hmm... close but it doesn't select the <img /> tag. :(

josh.trow Over a year ago

Uh...yes it does. What language are you using?

Oliver Spryn Over a year ago

My bad it did work, but not quite as expected. ActionScript 3.0 I'll post the code I'm using below to help out.

Oliver Spryn · Accepted Answer · 2011-05-24 17:28:07Z

1

EDIT: I just ended up using Flash's XML capabilities to read the HTML. No need for RegExp selectors!

Here is my ActionScript

var evaluatedInput:RegExp = new RegExp('<([A-Z][A-Z0-9]*)\\b[^>]*>(.*?)</\\1>', 'gi');
var result:Object = evaluatedInput.exec("<p>Hi!</p><span>Hi!</span><table><tbody><tr><td>Hi!</td></tr></tbody></table><img src=\"nice.jpg\" />");

while (result != null) {             
  trace (result);
  result = evaluatedInput.exec("<p>Hi!</p><span>Hi!</span><table><tbody><tr><td>Hi!</td></tr></tbody></table><img src=\"nice.jpg\" />");
}

The content in my output window is, which is exactly what I wanted, only top-level tags are selected:

<p>Hi!</p>,p,Hi!
<span>Hi!</span>,span,Hi!
<table><tbody><tr><td>Hi!</td></tr></tbody></table>,table,<tbody><tr><td>Hi!</td></tr></tbody>

Using the suggested regexp above I get:

<p>,p
</p>,p
<span>,span
</span>,span
<table>,table
<tbody>,tbody
<tr>,tr
<td>,td
</td>,td
</tr>,tr
</tbody>,tbody
</table>,table
<img src="nice.jpg" />,img

So to improve the new regexp I'd like it to:

Select only top level HTML tags, not nested ones
Return the tag and tag attributes of what it just selected
Return the contents, HTML and all, of the tag it selected

Sorry for the crash list of details. :(

edited May 24, 2011 at 17:28

answered Apr 26, 2011 at 18:37

Oliver Spryn

17.4k33 gold badges106 silver badges200 bronze badges

1 Comment

omninonsense Over a year ago

I suggest looking into an XHTML parser, or something. Doing this with regexp would be possible, but really, really unpleasant.

Collectives™ on Stack Overflow

HTML Regexp Selector

2 Answers 2

3 Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related