6

I'm trying to write a regex that will find a string of HTML tags inside a code editor (Khan Live Editor) and give the following error:

"You can't put <h1.. 2.. 3..> inside <p> elements."

This is the string I'm trying to match:

<p> ... <h1>

This the string I don't want to match:

<p> ... </p><h1>

Instead the expected behavior is that another error message appears in this situation.

So in English I want a string that;
- starts with <p> and
- ends with <h1> but
- does not contain </p>.

It's easy enough to make this work if I don't care about the existence of a </p>. My expression looks like this, /<p>.*<h[1-6]>/ and it works fine. But I need to make sure that </p> does not come between the <p> and <h1> tags (or any <h#> tag, hence the <h[1-6]>).


I've tried a lot of different expressions from some other posts on here:

Regular expression to match a line that doesn't contain a word?

From which I tried: <p>^((?!<\/p>).)*$</h1>

regex string does not contain substring

From which I tried: /^<p>(?!<\/p>)<h1>$/

Regular expression that doesn't contain certain string

This link suggested: aa([^a] | a[^a])aa

Which doesn't work in my case because I need the specific string "</p>" not just the characters of it since there might be other tags between <p> ... <h1>.


I'm really stumped here. The regex I've tried seems like it should work... Any idea how I would make this work? Maybe I'm implementing the suggestions from other posts wrong?

Thanks in advance for any help.

Edit:

To answer why I need this done:

The problem is that <p><h1></h1></p> is a syntax error since h1 closes the first <p> and there is an unmatched </p>. The original syntax error is not informative, but in most cases it is correct; my example being the exception. I'm trying to pass the syntax parser a new message to override the original message if the regex finds this exception.

13
  • Exactly. So the problem is that <p><h1></h1></p> is a syntax error since h1 closes the first <p> and there is an unmatched </p>. The original syntax error is not informative, but in most cases it is correct; my example being the exception. I'm trying to pass the syntax parser a new message to override the original message if the regex finds this exception. Commented Nov 24, 2015 at 18:53
  • This has nothing to do with your regex question, but it is actually correct and fine to have html content that contains an <h1>, <p>, etc before an explicit </p> as, in HTML5 (which has this flow-content rule) the </p> is completely optional. For instance: <p>Paragraph 1.<p>Paragraph 2.<h2>Heading</h2><p>Paragraph 3. Is completely valid HTML5 and can be authored as such intentionally. Commented Nov 24, 2015 at 18:57
  • Should we assume you don't ever have attributes or whitespace in the tags? Commented Nov 24, 2015 at 19:03
  • @AlanMcBee Yes that's true. Commented Nov 24, 2015 at 19:05
  • 1
    @DanFletcher You said that RegEx is your only option. However, you can cheat your validator and pass a RegEx from an IIFE in argument list, and utilize Niet the Dark Absol's code. Please check a fiddle. Commented Nov 24, 2015 at 19:45

5 Answers 5

6

Sometimes it's better to break a problem down.

var str = "YOUR INPUT HERE";
str = str.substr(str.indexOf("<p>"));
str = str.substr(0,str.lastIndexOf("<h1>"));
if( str.indexOf("</p>") > -1) {
    // there is a <p>...</p>...<h1>
}
else {
    // there isn't
}

This code doesn't handle the case of "what if there is no <p> to begin with" very well, but it does give a basic idea of how to break a problem down into simpler parts, without using regex.

Sign up to request clarification or add additional context in comments.

3 Comments

If it can be done without regex (Without adding too much complexity), then it should be done. +1
Thank you. In this situation - at least for now - regex is my only option.
This is actually a viable solution for my problem. As @Teemu pointed out to me, I could pass my validator a IIFE and this would work. Thanks again!
3

Search for <p> followed by any number of characters ([^] means any character that is not nothing, this allows us to also capture newlines) that are not followed by </p> which is eventually followed by <h[1-6]>.

/<p>(?:[^](?!<\/p>))*<h[1-6]>/gi

RegEx101 Test Case

enter image description here

const strings = [ '<p> ... <h1>', '<p> ... </p><h1>', '<P> Hello <h1>', '<p></p><h1>',
                  '<p><h1>' ];

const regex = /<p>(?:(?!<\/p>)[^])*<h[1-6]>/gi;

const test = input => ({ input, test: regex.test(input), matches: input.match(regex) });

for(let input of strings) console.log(JSON.stringify(test(input)));

// { "input": "<p> ... <h1>",     "test": true,  "matches": ["<p> ... <h1>"]   }
// { "input": "<p> ... </p><h1>", "test": false, "matches": null               }
// { "input": "<P> Hello <h1>",   "test": true,  "matches": ["<P> Hello <h1>"] }
// { "input": "<p></p><h1>",      "test": false, "matches": null               }
// { "input": "<p><h1>",          "test": true,  "matches": ["<p><h1>"]        }
.as-console-wrapper { max-height: 100% !important; min-height: 100% !important; }

Comments

2

Your first regular expression was close, but needed to remove the ^ and $ characters. If you need to match across newlines, you should use [/s/S] instead of ..

Here's the final regex: <p>(?:(?!<\/p>)[\s\S])*<h[1-6]>

However, having a header tag (<h1> - <h6>) is perfectly legal inside a paragraph element. They're just considered sibling elements, with the paragraph element ending where the header element begins.

A p element’s end tag may be omitted if the p element is immediately followed by an address, article, aside, blockquote, dir, div, dl, fieldset, footer, form, h1, h2, h3, h4, h5, h6, header, hr, menu, nav, ol, p, pre, section, table, or ul element, or if there is no more content in the parent element and the parent element is not an a element.

http://www.w3.org/TR/html-markup/p.html

1 Comment

Wow! Thank you so much! This works better than I need it to :) The reason we catch <p><h1></h1></p> btw is because it shouldn't pass a validation and we are trying to teach good practices. Thanks again.
1

I'm reaching the conclusion that using a regular expression to find the error is going to turn your one problem into two problems.

Consequently, I think a better approach is to do a very simplistic form of tree parsing. A "poor-man's HTML parser", if you will.

Use a simple regular expression to simply find all tags in the HTML, and put them into a list in the same order in which they were found. Ignore the text nodes between the tags.

Then, walk through the list in order, keeping a running tally on the tags. Increment the P counter when you get a <p> tag, and decrement it when you get a </p> tag. Increment the H counter and the H counter when you get to a <h1> (etc.) tag, decrement on the closing tag.

If the H counter is > 0 while the P counter is > 0, that's your error.

1 Comment

Thank you so much for taking the time to do that. This would definitely work!
-2

I know im not formatting it correctly but I think the logic will work,

(just replace the AND and NOT with the correct symbols):

/(<p>.*<h[1-6]>)AND !(<p>.*</p><h[1-6]>)/

Let me know how it goes :)

1 Comment

Thanks, but if it was that simple I would have done that. It's not understanding the logic of it that I'm struggling with, it's implementing that logic into a regex.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.