2

I a real big noobie when it comes to regexp, so please bear with me. I would like create a regular expression which can select all HTML tags. I have the following selector...

/<([A-Z][A-Z0-9]*)\\b[^>]*>(.*?)</\\1>/gi

... which works great for tags like this...

<p>Paragraph</p>
<span>Span</span>
<p><a href="link.php">Link</a></p>

... but it can't select tags like this:

<img src="picture.jpg" />

Could someone please direct me as to how I could fix the regular expression above so that I could select both styles of HTML tags in one clean move?

Thank your for your time,
spryno724

4
  • 2
    While a direct opposite of stackoverflow.com/questions/1732348/…, both questions have the same answer. Commented Apr 26, 2011 at 17:32
  • 1
    Oh, Bolt, I love that post. LOL Commented Apr 26, 2011 at 17:42
  • 2
    A comedy comment that does nothing to help the user is just plain mean. Commented Apr 26, 2011 at 19:51
  • It isn't very clear what is your goal. You want to "select all HTML tags" - from where? How will you use them? If you have an HTML file, all tags are contained whiting the <body> and <html> tags. Also, your pattern fails when dealing with nested tags: <i><i></i></i>. Commented Apr 26, 2011 at 20:03

2 Answers 2

1

Hmm. Okay, so you're looking for something like:

/</?([a-z][a-z0-9]*)[^<>]*>/
Sign up to request clarification or add additional context in comments.

3 Comments

Hmm... close but it doesn't select the <img /> tag. :(
Uh...yes it does. What language are you using?
My bad it did work, but not quite as expected. ActionScript 3.0 I'll post the code I'm using below to help out.
1

EDIT: I just ended up using Flash's XML capabilities to read the HTML. No need for RegExp selectors!

Here is my ActionScript

var evaluatedInput:RegExp = new RegExp('<([A-Z][A-Z0-9]*)\\b[^>]*>(.*?)</\\1>', 'gi');
var result:Object = evaluatedInput.exec("<p>Hi!</p><span>Hi!</span><table><tbody><tr><td>Hi!</td></tr></tbody></table><img src=\"nice.jpg\" />");

while (result != null) {             
  trace (result);
  result = evaluatedInput.exec("<p>Hi!</p><span>Hi!</span><table><tbody><tr><td>Hi!</td></tr></tbody></table><img src=\"nice.jpg\" />");
}

The content in my output window is, which is exactly what I wanted, only top-level tags are selected:

<p>Hi!</p>,p,Hi!
<span>Hi!</span>,span,Hi!
<table><tbody><tr><td>Hi!</td></tr></tbody></table>,table,<tbody><tr><td>Hi!</td></tr></tbody>

Using the suggested regexp above I get:

<p>,p
</p>,p
<span>,span
</span>,span
<table>,table
<tbody>,tbody
<tr>,tr
<td>,td
</td>,td
</tr>,tr
</tbody>,tbody
</table>,table
<img src="nice.jpg" />,img

So to improve the new regexp I'd like it to:

  • Select only top level HTML tags, not nested ones
  • Return the tag and tag attributes of what it just selected
  • Return the contents, HTML and all, of the tag it selected

Sorry for the crash list of details. :(

1 Comment

I suggest looking into an XHTML parser, or something. Doing this with regexp would be possible, but really, really unpleasant.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.