Regex select all text between tags

Question

What is the best way to select all the text between 2 tags - ex: the text between all the '<pre>' tags on the page.

Best way is to use a html-parser like "Beautiful Soup" if you're into python... — Fredrik Pihl
– Fredrik Pihl, Commented Aug 23, 2011 at 20:45
In general, using regular expressions to parse html is not a good idea:stackoverflow.com/questions/1732348/… — murgatroid99
– murgatroid99, Commented Aug 23, 2011 at 20:46
Do not parse text between tags with regex because arbitrarily nested tags make HTML non-regular. Matching tags seems to be okay. /<div>.*?<\/div>/.exec("<div><div></div></div>") — jdh8
– jdh8, Commented Aug 19, 2017 at 17:46

PyKing · Accepted Answer · 2011-08-23 21:00:44Z

216

You can use "<pre>(.*?)</pre>", (replacing pre with whatever text you want) and extract the first group (for more specific instructions specify a language) but this assumes the simplistic notion that you have very simple and valid HTML.

As other commenters have suggested, if you're doing something complex, use a HTML parser.

answered Aug 23, 2011 at 21:00

PyKing

2,6071 gold badge17 silver badges5 bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

capikaw Over a year ago

This does not select the text between the tags, it includes the tags.

Vishal Kumar Sahu Over a year ago

You need to grab the selection using ()

Felipe Augusto Over a year ago

For multi line tags: <html_tag>(.+)((\s)+(.+))+<\/html_tag>

rbsdca Over a year ago

This still has visibility so: If you're still seeing <pre> tags after trying <pre>(.*?)<\/pre>, it's because you're looking at what's captured by the Full match instead of the (.*?) capture group. Sounds cheesy but I always think "parenthesis = pair of thieves" because unless the ( is followed by a ? as in (?: or (?>, every match will have two captures: 1 for the full match & 1 for the capture group. Each additional set of parenthesis adds a additional capture. You just have to know how to retrieve both captures in whatever language you're working with.

phil123456 Over a year ago

you need to escape /

|

Vikas · Accepted Answer · 2015-03-17 11:21:27Z

202

Tag can be completed in another line. This is why \n needs to be added.

<PRE>(.|\n)*?<\/PRE>

edited Mar 17, 2015 at 11:21

Vikas

24.4k37 gold badges119 silver badges159 bronze badges

answered Jun 2, 2013 at 7:57

Zac Dreyer

2,3492 gold badges17 silver badges13 bronze badges

5 Comments

Caleuanhopkins Over a year ago

Important point about adding (.|\n)*? when dealing with HTML tags across multiple lines. The selected answer works only if the HTML tags are on the same line.

Mark Over a year ago

<PRE>(.|\n|\r\n)*?<\/PRE> for Windows line endings

Wiktor Stribiżew Over a year ago

Never use (.|\n)*? to match any char. Always use . with s (singleline) modifier. Or a [\s\S]*? workaround.

wkille Over a year ago

I wanted to select code comments in notepad++, so using this answer I came up with /\*(.|\n)*?\*/ which did the job -- thank you

Eamonn Kenny Over a year ago

can somebody clarify how you use this with sed and how do you gather the output with \1? I've tried and it fails for me with sed -e.

Community · Accepted Answer · 2020-02-25 13:12:05Z

56

To exclude the delimiting tags:

(?<=<pre>)(.*?)(?=</pre>)

(?<=<pre>) looks for text after <pre>

(?=</pre>) looks for text before </pre>

Results will text inside pre tag

edited Feb 25, 2020 at 13:12

CommunityBot

11 silver badge

answered Jul 4, 2018 at 19:31

Jean-Simon Collard

6516 silver badges4 bronze badges

4 Comments

KingKongCoder Over a year ago

People using this look at @krishna thakor's answer which can also consider if the content has new line between tags

Pking Over a year ago

This helped in my case (not needing to consider newlines). Thanks.

Evan Kleiner Over a year ago

This doesn't work if you have multiple elements. E.g., <pre>first</pre><pre>second</pre>

Amine KOUIS Over a year ago

@EvanKleiner I checked it here and it's working fine.

DevWL · Accepted Answer · 2021-09-06 14:22:52Z

43

This is what I would use.

(?<=(<pre>))(\w|\d|\n|[().,\-:;@#$%^&*\[\]"'+–/\/®°⁰!?{}|`~]| )+?(?=(</pre>))

Basically what it does is:

(?<=(<pre>)) Selection have to be prepend with <pre> tag

(\w|\d|\n|[().,\-:;@#$%^&*\[\]"'+–/\/®°⁰!?{}|~]| ) This is just a regular expression I want to apply. In this case, it selects letter or digit or newline character or some special characters listed in the example in the square brackets. The pipe character | simply means "OR".

+? Plus character states to select one or more of the above - order does not matter. Question mark changes the default behavior from 'greedy' to 'ungreedy'.

(?=(</pre>)) Selection have to be appended by the </pre> tag

Depending on your use case you might need to add some modifiers like (i or m)

i - case-insensitive
m - multi-line search

Here I performed this search in Sublime Text so I did not have to use modifiers in my regex.

Javascript does not support lookbehind

The above example should work fine with languages such as PHP, Perl, Java ...
Javascript however does not support lookbehind so we have to forget about using `(?))` and look for some kind of workaround. Perhaps simple strip the first four chars from our result for each selection like in here https://stackoverflow.com/questions/11592033/regex-match-text-between-tags

Also look at the JAVASCRIPT REGEX DOCUMENTATION for non-capturing parentheses

edited Sep 6, 2021 at 14:22

answered Dec 1, 2016 at 10:20

DevWL

19k6 gold badges98 silver badges92 bronze badges

5 Comments

David Zwart Over a year ago

Note that you need to escape the single/double quote characters with ` in order to put the regexp in a string.

bkqc Over a year ago

Wouldn't this break if you had <pre>Foo<pre>Bar</pre>Zed</pre> and return only <pre>Foo<pre>Bar</pre>?

DevWL Over a year ago

@bkqc why would you nest <pre> tags? The expression above would return only the inner tag. If you want to return outer tag, you would use .* instead of +?. Like so: (?<=(<pre>))(\w|\d|\n|[().,\-:;@#$%^&*\[\]"'+–/\/®°⁰!?{}|`]| ).*(?=(</pre>))

DevWL Over a year ago

And if you need to get them all, you would programmatically run the same outer expression recursively until no match will be found.

bkqc Over a year ago

@DevWL The only reason to nest pre tags is because it is what is used for the example but the OP is initially asking for generic tags. Also, with your second version, though it matches the specific case mentionned, it will "fail" on <pre>Foo<pre>Bar</pre>Zed</pre>dsada<pre>test</pre> with this result Foo<pre>Bar</pre>Zed</pre>dsada<pre>test

Ryan M · Accepted Answer · 2020-11-12 01:27:47Z

28

This answer supposes support for look around! This allowed me to identify all the text between pairs of opening and closing tags. That is all the text between the '>' and the '<'. It works because look around doesn't consume the characters it matches.

(?<=>)([\w\s]+)(?=<\/)

I tested it in https://regex101.com/ using this HTML fragment.

<table>
<tr><td>Cell 1</td><td>Cell 2</td><td>Cell 3</td></tr>
<tr><td>Cell 4</td><td>Cell 5</td><td>Cell 6</td></tr>
</table>

It's a game of three parts: the look behind, the content, and the look ahead.

(?<=>)    # look behind (but don't consume/capture) for a '>'
([\w\s]+) # capture/consume any combination of alpha/numeric/whitespace
(?=<\/)   # look ahead  (but don't consume/capture) for a '</'

I hope that serves as a started for 10. Luck.

edited Nov 12, 2020 at 1:27

Ryan M♦

20.6k35 gold badges75 silver badges85 bronze badges

answered Dec 22, 2019 at 17:18

Clarius

1,47918 silver badges13 bronze badges

5 Comments

Sean Feldman Over a year ago

Thank you. Not only this is a better answer, but also a great link to the regex101 site. Upvoted! 🙂

Raphael Setin Over a year ago

The above regex is excellent as is, but it will only return the first match found and won't cover special chars nor new lines. For that, use the following instead: myString.match(/(?<=>)([\w\s\-\!@#$%^&*()_+|~={}[]:";'?,.\/]+)(?=<\/)/gm);`. This will return an array with all the matches, including almost all special characters available.

Vijayakumar Over a year ago

@RaphaelSetin Instead of having a big regex combination for words and special characters we can have (?<=>)([^>]*)(?=<\/) ., This will match all the words, space and special character inside the text

Raphael Setin Over a year ago

@Vijayakumar I am not an expert with RegEx, that's why my proposed solution wasn't that fancy haha. If your solution works, that's even better. But the caveat in my opinion is that I don't know what special characters exactly yours covers. You should mention them at least.

Flame Over a year ago

Sadly this answer doesn't cover all cases. Simply add a Cell<b>bold</b> to any cell in the example and the regex match will be incomplete and faulty

norok2 · Accepted Answer · 2019-07-18 08:10:00Z

25

use the below pattern to get content between element. Replace [tag] with the actual element you wish to extract the content from.

<[tag]>(.+?)</[tag]>

Sometime tags will have attributes, like anchor tag having href, then use the below pattern.

 <[tag][^>]*>(.+?)</[tag]>

edited Jul 18, 2019 at 8:10

norok2

27.1k6 gold badges83 silver badges110 bronze badges

answered Nov 11, 2015 at 17:14

Shravan Ramamurthy

4,1045 gold badges34 silver badges44 bronze badges

4 Comments

Alex Byrth Over a year ago

Try first example as '<head>(.+?)</head>' and works like expected. But I have no results with second one.

Martin Schneider Over a year ago

this doesn't work. <[tag]> will match <t>, <a> and <g>

LWC Over a year ago

@MA-Maddin - I think you missed the Replace [tag] with the actual element you wish to extract the content from part.

Martin Schneider Over a year ago

Oh well, yes. These [] should have been omitted altogether. That would be more clear, because of their meaning in RegEx and the fact, that people scan the code first and read the text after ;)

maqduni · Accepted Answer · 2018-08-30 09:19:02Z

14

This seems to be the simplest regular expression of all that I found

(?:<TAG>)([\s\S]*)(?:<\/TAG>)

Exclude opening tag (?:<TAG>) from the matches
Include any whitespace or non-whitespace characters ([\s\S]*) in the matches
Exclude closing tag (?:<\/TAG>) from the matches

answered Aug 30, 2018 at 9:19

maqduni

4977 silver badges9 bronze badges

4 Comments

Cody Over a year ago

Thank you. I burned through all of the above before this one worked for me. Needed one to scrape SCSS in and HTML file -- innerHTML of style[lang="scss"] -- and this did the trick. Here it is: regex101.com/r/VqhNsI/1.

kiwichris Over a year ago

only one that worked for me (javascript)

Amine KOUIS Over a year ago

it fails when the pre tag have the style attribute, check this regex demo.

maqduni Over a year ago

@AmineKOUIS try (?:<pre[^>]*>)([\s\S]*)(?:<\/pre>)

Community · Accepted Answer · 2017-05-23 12:26:29Z

9

You shouldn't be trying to parse html with regexes see this question and how it turned out.

In the simplest terms, html is not a regular language so you can't fully parse is with regular expressions.

Having said that you can parse subsets of html when there are no similar tags nested. So as long as anything between and is not that tag itself, this will work:

preg_match("/<([\w]+)[^>]*>(.*?)<\/\1>/", $subject, $matches);
$matches = array ( [0] => full matched string [1] => tag name [2] => tag content )

A better idea is to use a parser, like the native DOMDocument, to load your html, then select your tag and get the inner html which might look something like this:

$obj = new DOMDocument();
$obj -> load($html);
$obj -> getElementByTagName('el');
$value = $obj -> nodeValue();

And since this is a proper parser it will be able to handle nesting tags etc.

edited May 23, 2017 at 12:26

CommunityBot

11 silver badge

answered Aug 23, 2011 at 21:25

sg3s

9,5773 gold badges38 silver badges53 bronze badges

5 Comments

sg3s Over a year ago

Just want to say I'm a slight bit disturbed that this is still gathering downvotes while it is the only answer which supplies a proper solution next to the regex one and I also added ample warning that it is probably not the right way... At least comment on what is so wrong about my answer, please.

trincot Over a year ago

The question was not tagged with php. Not sure how PHP came into the picture...

sg3s Over a year ago

@trincot This was more than 7 years ago, so I cannot remember. In any case it is an example of solving the problem with a regex and with a parser. The regex is good and php is just what I knew well at the time.

trincot Over a year ago

I understand, I saw your first comment and thought that this could explain some of the downvotes.

CS QGB Over a year ago

"/<([\w]+)[^>]*>(.*?)<\/\1>/" in python not match

Heriberto Rivera · Accepted Answer · 2015-10-23 18:31:58Z

5

Try this....

(?<=\<any_tag\>)(\s*.*\s*)(?=\<\/any_tag\>)

answered Oct 23, 2015 at 18:31

Heriberto Rivera

4214 silver badges8 bronze badges

2 Comments

allicarn Over a year ago

Note that look behind is not supported in JavaScript.

Heriberto Rivera Over a year ago

Ooo of course, but this regex is for Java. thanks for your note.

aptyp · Accepted Answer · 2023-01-30 00:36:34Z

3

(?<=>)[^<]+

for Notepad++

>([^<]+)

for AutoIt (option Return array of global matches).

or

 (?=>([^<]+))

https://regex101.com/r/VtmEmY/1

edited Jan 30, 2023 at 0:36

answered May 6, 2021 at 18:17

aptyp

519 bronze badges

Comments

Shishir Arora · Accepted Answer · 2017-08-28 01:41:51Z

2

var str = "Lorem ipsum <pre>text 1</pre> Lorem ipsum <pre>text 2</pre>";
    str.replace(/<pre>(.*?)<\/pre>/g, function(match, g1) { console.log(g1); });

Since accepted answer is without javascript code, so adding that:

edited Aug 28, 2017 at 1:41

answered Aug 28, 2017 at 1:12

Shishir Arora

5,9934 gold badges33 silver badges36 bronze badges

Comments

Krishna thakor · Accepted Answer · 2018-10-16 10:48:40Z

2

preg_match_all(/<pre>([^>]*?)<\/pre>/,$content,$matches) this regex will select everyting between tag. no matter is it in new line(work with multiline.

edited Oct 16, 2018 at 10:48

answered Oct 16, 2018 at 10:42

Krishna thakor

18512 bronze badges

Comments

John · Accepted Answer · 2019-10-18 04:46:52Z

In Python, setting the DOTALL flag will capture everything, including newlines.

If the DOTALL flag has been specified, this matches any character including a newline. docs.python.org

#example.py using Python 3.7.4  
import re

str="""Everything is awesome! <pre>Hello,
World!
    </pre>
"""

# Normally (.*) will not capture newlines, but here re.DOTATLL is set 
pattern = re.compile(r"<pre>(.*)</pre>",re.DOTALL)
matches = pattern.search(str)

print(matches.group(1))

python example.py

Hello,
World!

Capturing text between all opening and closing tags in a document

To capture text between all opening and closing tags in a document, finditer is useful. In the example below, three opening and closing <pre> tags are present in the string.

#example2.py using Python 3.7.4
import re

# str contains three <pre>...</pre> tags
str = """In two different ex-
periments, the authors had subjects chat and solve the <pre>Desert Survival Problem</pre> with a
humorous or non-humorous computer. In both experiments the computer made pre-
programmed comments, but in study 1 subjects were led to believe they were interact-
ing with another person. In the <pre>humor conditions</pre> subjects received a number of funny
comments, for instance: “The mirror is probably too small to be used as a signaling
device to alert rescue teams to your location. Rank it lower. (On the other hand, it
offers <pre>endless opportunity for self-reflection</pre>)”."""

# Normally (.*) will not capture newlines, but here re.DOTATLL is set
# The question mark in (.*?) indicates non greedy matching.
pattern = re.compile(r"<pre>(.*?)</pre>",re.DOTALL)

matches = pattern.finditer(str)


for i,match in enumerate(matches):
    print(f"tag {i}: ",match.group(1))

python example2.py

tag 0:  Desert Survival Problem
tag 1:  humor conditions
tag 2:  endless opportunity for self-reflection

axellbrendow · Accepted Answer · 2022-11-12 22:40:05Z

More complex than PyKing's answer but matches any type of tag (except self-closing) and considers cases where the tag has HTML-like string attributes.

/<TAG_NAME(?:STRING|NOT_CLOSING_TAG_NOT_QUOTE)+>INNER_HTML<\/\1 *>/g

Raw: /<([^\s</>]+)(?:("(?:[^"\\]|\\.)*")|[^>"])+>(.*?)<\/\1 *>/g

Regex Railroad diagram:

group #1 = tag name

group #2 = string attr

group #3 = inner html

JavaScript code testing it:

let TAG_NAME = '([^\s</>]+)';
let NOT_CLOSING_TAG_NOT_QUOTE = '[^>"]';
let STRING = '("(?:[^"\\\\]|\\\\.)*")';

let NON_SELF_CLOSING_HTML_TAG =
                                                              // \1 is a back reference to TAG_NAME
    `<${TAG_NAME}(?:${STRING}|${NOT_CLOSING_TAG_NOT_QUOTE})+>(.*?)</\\1 *>`;

let tagRegex = new RegExp(NON_SELF_CLOSING_HTML_TAG, 'g');

let myStr = `Aenean <abc href="/life<><>\\"<?/abc></abc>"><a>life</a></abc> sed consectetur.
<a href="/work">Work Inner HTML</a> quis risus eget <a href="/about">about inner html</a> leo.
interacted with any of the <<<ve text="<></ve>>">abc</ve>`;

let matches = myStr.match(tagRegex);

// Removing 'g' flag to match each tag part in the for loop
tagRegex = new RegExp(NON_SELF_CLOSING_HTML_TAG);

for (let i = 0; i < matches.length; i++) {
  let tagParts = matches[i].match(tagRegex);
  console.log(`Tag #${i} = [${tagParts[0]}]`);
  console.log(`Tag #${i} name: [${tagParts[1]}]`);
  console.log(`Tag #${i} string attr: [${tagParts[2]}]`);
  console.log(`Tag #${i} inner html: [${tagParts[3]}]`);
  console.log('');
}

Output:

Tag #0 = [<abc href="/life<><>\"<?/abc></abc>"><a>life</a></abc>]
Tag #0 name: [abc]
Tag #0 string attr: ["/life<><>\"<?/abc></abc>"]
Tag #0 inner html: [<a>life</a>]

Tag #1 = [<a href="/work">Work Inner HTML</a>]
Tag #1 name: [a]
Tag #1 string attr: ["/work"]
Tag #1 inner html: [Work Inner HTML]

Tag #2 = [<a href="/about">about inner html</a>]
Tag #2 name: [a]
Tag #2 string attr: ["/about"]
Tag #2 inner html: [about inner html]

Tag #3 = [<ve text="<></ve>>">abc</ve>]
Tag #3 name: [ve]
Tag #3 string attr: ["<></ve>>"]
Tag #3 inner html: [abc]

This doesn't work if:

The tag has any descendant tag of the same type
The tag start in one line and ends in another. (In my case I remove line breaks from HTML)

If you change (.*?)<\/\1 *> to ([\s\S]*?)<\/\1 *> it should match the tag's inner html even if everything is not in the same line. For some reason it didn't work for me on Chrome and Node but worked here with the JavaScript's Regex Engine:

https://www.regextester.com

Regex: <([^\s</>]+)(?:("(?:[^"\\]|\\.)*")|[^>"])+>([\s\S]*?)<\/\1 *>

Test String:

Aenean lacinia <abc href="/life<><><?/a></a>">  
<a>life</a></abc> sed consectetur.
<a href="/work">Work</a> quis risus eget urna mollis ornare <a href="/about">about</a> leo.
interacted with any of the <<<ve text="<></ve>>">abc</ve>

Dharman · Accepted Answer · 2021-03-30 17:37:40Z

1

To select all text between pre tag I prefer

preg_match('#<pre>([\w\W\s]*)</pre>#',$str,$matches);

$matches[0] will have results including <pre> tag

$matches[1] will have all the content inside <pre>.

DomDocument cannot work in situations where the requirement is to get text with tag details within the searched tag as it strips all tags, nodeValue & textContent will only return text without tags & attributes.

edited Mar 30, 2021 at 17:37

Dharman♦

33.9k27 gold badges106 silver badges157 bronze badges

answered Mar 30, 2021 at 17:32

nirvana74v

1,0912 gold badges17 silver badges31 bronze badges

Comments

Hamzat Oluwabori · Accepted Answer · 2022-07-28 21:56:48Z

1

 test.match(/<pre>(.*?)<\/pre>/g)?.map((a) => a.replace(/<pre>|<\/pre>/g, ""))

this should be a preferred solution.especially if you have multiple pre tags in the context

answered Jul 28, 2022 at 21:56

Hamzat Oluwabori

413 bronze badges

Comments

James Gardiner · Accepted Answer · 2022-10-17 17:20:02Z

1

How about:

<PRE>(\X*?)<\/PRE>

answered Oct 17, 2022 at 17:20

James Gardiner

233 bronze badges

Comments

Ammy · Accepted Answer · 2017-02-17 15:10:32Z

0

You can use Pattern pattern = Pattern.compile( "[^<'tagname'/>]" );

answered Feb 17, 2017 at 15:10

Ammy

3792 silver badges8 bronze badges

Comments

Sven Eberth · Accepted Answer · 2021-07-16 00:03:05Z

0

const content = '<p class="title responsive">ABC</p>';
const blog = {content};
const re = /<([^> ]+)([^>]*)>([^<]+)(<\/\1>)/;
const matches = content.match(re);
console.log(matches[3]);

matches[3] is the content text and this is adapted to any tag name with classes. (not support nested structures)

edited Jul 16, 2021 at 0:03

Sven Eberth

3,09012 gold badges27 silver badges31 bronze badges

answered Jul 15, 2021 at 23:29

coosigma

1

Comments

Dilip · Accepted Answer · 2016-11-16 22:18:11Z

-1

For multiple lines:

<htmltag>(.+)((\s)+(.+))+</htmltag>

edited Nov 16, 2016 at 22:18

answered Nov 16, 2016 at 19:10

Dilip

11 bronze badge

Comments

T.Todua · Accepted Answer · 2017-11-29 14:50:08Z

-1

I use this solution:

preg_match_all( '/<((?!<)(.|\n))*?\>/si',  $content, $new);
var_dump($new);

answered Nov 29, 2017 at 14:50

T.Todua

57.1k22 gold badges261 silver badges266 bronze badges

Comments

Jonathan · Accepted Answer · 2020-05-16 06:33:34Z

-1

In Javascript (among others), this is simple. It covers attributes and multiple lines:

/<pre[^>]*>([\s\S]*?)<\/pre>/

answered May 16, 2020 at 6:33

Jonathan

4,1358 gold badges54 silver badges90 bronze badges

Comments

user5988518 · Accepted Answer · 2016-02-26 23:04:04Z

-4

<pre>([\r\n\s]*(?!<\w+.*[\/]*>).*[\r\n\s]*|\s*[\r\n\s]*)<code\s+(?:class="(\w+|\w+\s*.+)")>(((?!<\/code>)[\s\S])*)<\/code>[\r\n\s]*((?!<\w+.*[\/]*>).*|\s*)[\r\n\s]*<\/pre>

answered Feb 26, 2016 at 23:04

user5988518

1

1 Comment

Andrew Regan Over a year ago

Please introduce / explain your answer using words.

Collectives™ on Stack Overflow

Regex select all text between tags

23 Answers 23

8 Comments

5 Comments

4 Comments

This is what I would use.

Javascript does not support lookbehind

5 Comments

5 Comments

4 Comments

4 Comments

5 Comments

2 Comments

Comments

Comments

Comments

Capturing text between all opening and closing tags in a document

Comments

1 Comment

Comments

Comments

Comments

Comments

Comments

Comments

Comments

Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

23 Answers 23

8 Comments

5 Comments

4 Comments

This is what I would use.

Javascript does not support lookbehind

5 Comments

5 Comments

4 Comments

4 Comments

5 Comments

2 Comments

Comments

Comments

Comments

Capturing text between all opening and closing tags in a document

Comments

1 Comment

Comments

Comments

Comments

Comments

Comments

Comments

Comments

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related