10

I do not understand, why my columns reg1 and reg2 remove "bbb" from my string, and only reg3 works as expected.

WITH t AS (SELECT 'aaa <b>bbb</b> ccc' AS teststring FROM dual)

SELECT
  teststring,
  regexp_replace(teststring, '<.+>') AS reg1,
  regexp_replace(teststring, '<.*>') AS reg2,
  regexp_replace(teststring, '<.*?>') AS reg3
FROM t


TESTSTRING             REG1        REG2          REG3
aaa <b>bbb</b> ccc     aaa ccc     aaa ccc       aaa bbb ccc

Thanks a lot!

4 Answers 4

19

Because regex is greedy by default. I.e. the expressions .* or .+ try to take as many characters as possible. Therefore <.+> will span from the first < to the last >. Make it lazy by using the lazy operator ?:

regexp_replace(teststring, '<.+?>')

or

regexp_replace(teststring, '<.*?>')

Now, the search for > will stop at the first > encountered.

Note that . includes > as well, therefore the greedy variant (without ?) swallows all the > but the last.

Sign up to request clarification or add additional context in comments.

2 Comments

Thanks a lot! I thought, ? stands only for "one or zero".
? stands for "one or zero" unless when occuring after a quanitfier, where it stands as lazy operator. See: Lazy quantification.
1

You can remove HTML tags from string

REGEXP_REPLACE (teststring,'<[^>]*>',' ')

1 Comment

Although your answer is correct in the sense of obtaining the result expected by the OP, the OP was asking a why some of his code worked and some didn't work, he already had an answer, and was looking for an explanation. The explanation was provided 7 years ago, and your answer doesn't answer that question, and is posted in an old thread.
0

Because the first one and the second one are finding this match: <b>bbb</b> - in this case b>bbb</b matches both .* and .+

The third one also won't do what you need. You are looking for something like this: <[^>]*>. But you also need to replace all matches with ""

1 Comment

The third one do exacly what I need. I didn't understand, why. @Olivier gave the useful answer.
-1

If you are merely trying to display the string without all the HTML tags, you can use the function: utl_i18n.unescape_reference(column_name)

1 Comment

A late response but unescape_reference only replaces the tags for &lt;, etc. It doesn't strip out any other HTML (e.g. <p>).

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.