JavaScript Regex: Finding a String that does not contain </p>

Question

I'm trying to write a regex that will find a string of HTML tags inside a code editor (Khan Live Editor) and give the following error:

"You can't put <h1.. 2.. 3..> inside <p> elements."

This is the string I'm trying to match:

<p> ... <h1>

This the string I don't want to match:

<p> ... </p><h1>

Instead the expected behavior is that another error message appears in this situation.

So in English I want a string that;
- starts with <p> and
- ends with <h1> but
- does not contain </p>.

It's easy enough to make this work if I don't care about the existence of a </p>. My expression looks like this, /<p>.*<h[1-6]>/ and it works fine. But I need to make sure that </p> does not come between the <p> and <h1> tags (or any <h#> tag, hence the <h[1-6]>).

I've tried a lot of different expressions from some other posts on here:

Regular expression to match a line that doesn't contain a word?

From which I tried: <p>^((?!<\/p>).)*$</h1>

regex string does not contain substring

From which I tried: /^<p>(?!<\/p>)<h1>$/

Regular expression that doesn't contain certain string

This link suggested: aa([^a] | a[^a])aa

Which doesn't work in my case because I need the specific string "</p>" not just the characters of it since there might be other tags between <p> ... <h1>.

I'm really stumped here. The regex I've tried seems like it should work... Any idea how I would make this work? Maybe I'm implementing the suggestions from other posts wrong?

Thanks in advance for any help.

Edit:

To answer why I need this done:

The problem is that <p><h1></h1></p> is a syntax error since h1 closes the first <p> and there is an unmatched </p>. The original syntax error is not informative, but in most cases it is correct; my example being the exception. I'm trying to pass the syntax parser a new message to override the original message if the regex finds this exception.

Exactly. So the problem is that <p><h1></h1></p> is a syntax error since h1 closes the first <p> and there is an unmatched </p>. The original syntax error is not informative, but in most cases it is correct; my example being the exception. I'm trying to pass the syntax parser a new message to override the original message if the regex finds this exception. — Dan Fletcher
– Dan Fletcher, Commented Nov 24, 2015 at 18:53
This has nothing to do with your regex question, but it is actually correct and fine to have html content that contains an <h1>, <p>, etc before an explicit </p> as, in HTML5 (which has this flow-content rule) the </p> is completely optional. For instance: <p>Paragraph 1.<p>Paragraph 2.<h2>Heading</h2><p>Paragraph 3. Is completely valid HTML5 and can be authored as such intentionally. — rgthree
– rgthree, Commented Nov 24, 2015 at 18:57
Should we assume you don't ever have attributes or whitespace in the tags? — Alan McBee
– Alan McBee, Commented Nov 24, 2015 at 19:03
@DanFletcher You said that RegEx is your only option. However, you can cheat your validator and pass a RegEx from an IIFE in argument list, and utilize Niet the Dark Absol's code. Please check a fiddle. — Teemu
– Teemu, Commented Nov 24, 2015 at 19:45

Niet the Dark Absol · Accepted Answer · 2015-11-24 18:47:02Z

6

Sometimes it's better to break a problem down.

var str = "YOUR INPUT HERE";
str = str.substr(str.indexOf("<p>"));
str = str.substr(0,str.lastIndexOf("<h1>"));
if( str.indexOf("</p>") > -1) {
    // there is a <p>...</p>...<h1>
}
else {
    // there isn't
}

This code doesn't handle the case of "what if there is no <p> to begin with" very well, but it does give a basic idea of how to break a problem down into simpler parts, without using regex.

answered Nov 24, 2015 at 18:47

Niet the Dark Absol

326k86 gold badges480 silver badges604 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Blue Over a year ago

If it can be done without regex (Without adding too much complexity), then it should be done. +1

Dan Fletcher Over a year ago

Thank you. In this situation - at least for now - regex is my only option.

Dan Fletcher Over a year ago

This is actually a viable solution for my problem. As @Teemu pointed out to me, I could pass my validator a IIFE and this would work. Thanks again!

score 3 · Accepted Answer · 2018-10-26 20:49:26Z

Search for <p> followed by any number of characters ([^] means any character that is not nothing, this allows us to also capture newlines) that are not followed by </p> which is eventually followed by <h[1-6]>.

/<p>(?:[^](?!<\/p>))*<h[1-6]>/gi

RegEx101 Test Case

const strings = [ '<p> ... <h1>', '<p> ... </p><h1>', '<P> Hello <h1>', '<p></p><h1>',
                  '<p><h1>' ];

const regex = /<p>(?:(?!<\/p>)[^])*<h[1-6]>/gi;

const test = input => ({ input, test: regex.test(input), matches: input.match(regex) });

for(let input of strings) console.log(JSON.stringify(test(input)));

// { "input": "<p> ... <h1>",     "test": true,  "matches": ["<p> ... <h1>"]   }
// { "input": "<p> ... </p><h1>", "test": false, "matches": null               }
// { "input": "<P> Hello <h1>",   "test": true,  "matches": ["<P> Hello <h1>"] }
// { "input": "<p></p><h1>",      "test": false, "matches": null               }
// { "input": "<p><h1>",          "test": true,  "matches": ["<p><h1>"]        }

.as-console-wrapper { max-height: 100% !important; min-height: 100% !important; }

Pluto · Accepted Answer · 2015-11-24 19:41:23Z

2

Your first regular expression was close, but needed to remove the ^ and $ characters. If you need to match across newlines, you should use [/s/S] instead of ..

Here's the final regex: <p>(?:(?!<\/p>)[\s\S])*<h[1-6]>

However, having a header tag (<h1> - <h6>) is perfectly legal inside a paragraph element. They're just considered sibling elements, with the paragraph element ending where the header element begins.

A p element’s end tag may be omitted if the p element is immediately followed by an address, article, aside, blockquote, dir, div, dl, fieldset, footer, form, h1, h2, h3, h4, h5, h6, header, hr, menu, nav, ol, p, pre, section, table, or ul element, or if there is no more content in the parent element and the parent element is not an a element.

http://www.w3.org/TR/html-markup/p.html

answered Nov 24, 2015 at 19:41

Pluto

3,03630 silver badges38 bronze badges

1 Comment

Dan Fletcher Over a year ago

Wow! Thank you so much! This works better than I need it to :) The reason we catch <p><h1></h1></p> btw is because it shouldn't pass a validation and we are trying to teach good practices. Thanks again.

Alan McBee · Accepted Answer · 2015-11-24 19:46:42Z

1

I'm reaching the conclusion that using a regular expression to find the error is going to turn your one problem into two problems.

Consequently, I think a better approach is to do a very simplistic form of tree parsing. A "poor-man's HTML parser", if you will.

Use a simple regular expression to simply find all tags in the HTML, and put them into a list in the same order in which they were found. Ignore the text nodes between the tags.

Then, walk through the list in order, keeping a running tally on the tags. Increment the P counter when you get a <p> tag, and decrement it when you get a </p> tag. Increment the H counter and the H counter when you get to a <h1> (etc.) tag, decrement on the closing tag.

If the H counter is > 0 while the P counter is > 0, that's your error.

answered Nov 24, 2015 at 19:46

Alan McBee

4,3403 gold badges36 silver badges42 bronze badges

1 Comment

Dan Fletcher Over a year ago

Thank you so much for taking the time to do that. This would definitely work!

Chris Conaty · Accepted Answer · 2015-11-24 18:46:34Z

-2

I know im not formatting it correctly but I think the logic will work,

(just replace the AND and NOT with the correct symbols):

/(<p>.*<h[1-6]>)AND !(<p>.*</p><h[1-6]>)/

Let me know how it goes :)

answered Nov 24, 2015 at 18:46

Chris Conaty

1031 gold badge2 silver badges11 bronze badges

1 Comment

Dan Fletcher Over a year ago

Thanks, but if it was that simple I would have done that. It's not understanding the logic of it that I'm struggling with, it's implementing that logic into a regex.

Collectives™ on Stack Overflow

JavaScript Regex: Finding a String that does not contain </p>

5 Answers 5

3 Comments

Comments

1 Comment

1 Comment

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

3 Comments

Comments

1 Comment

1 Comment

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related