2

I have a table that has a column that holds an html string that may or may not be well formed (tried to use xmltable route and it didn't work) which is why I am trying to use the following sql syntax. I am trying to create a query that uses REGEXP_SUBSTR, LEVEL and CONNECT BY LEVEL based on examples I found online but I am not able to get the results I would expect:

I created an example query below:

with qry as (
select 1 as id, '<div>What would you like to do today?<a href="#" data-jump="111">Client Service</a><a href="#" data-jump="112">Customer Service</a><a href="#" data-jump="113">Tecnical Service</a></div>' as html_string from dual
union
select 2 as id, '<div>What would you like to do today?<a href="#" data-jump="111">Client Service</a><a href="#" data-jump="112">Customer Service</a><a href="#" data-jump="113">Tecnical Service</a></div><a href="#" data-jump="114">Other Service</a></div>' as html_string from dual
)
SELECT ID,
REGEXP_SUBSTR(html_string, '<a.*?>(.*?).</a>', 1, LEVEL, NULL, 1) as contents,
REGEXP_SUBSTR(html_string, 'data-jump="(.*?)"', 1, LEVEL, NULL, 1) as data_jump
FROM qry
CONNECT BY LEVEL <= REGEXP_COUNT(html_string, '<a.*?>(.*?).</a>')

By running this query, I would expect 7 rows to come back:

ID Contents Data_Jump
1 Client Servic 111
1 Customer Servic 112
1 Tecnical Servic 113
2 Other Servic 114
2 Tecnical Servic 113
2 Other Servic 114
2 Customer Servic 112

Instead its coming back with 22 rows.

ID Contents Data_Jump
1 Client Servic 111
1 Customer Servic 112
1 Tecnical Servic 113
2 Other Servic 114
2 Tecnical Servic 113
2 Other Servic 114
2 Customer Servic 112
1 Tecnical Servic 113
2 Other Servic 114
2 Tecnical Servic 113
2 Other Servic 114
2 Client Servic 111
1 Customer Servic 112
1 Tecnical Servic 113
2 Other Servic 114
2 Tecnical Servic 113
2 Other Servic 114
2 Customer Servic 112
1 Tecnical Servic 113
2 Other Servic 114
2 Tecnical Servic 113
2 Other Servic 114

I don't know where the problem is that I am getting duplicates. On the actual production table there are a lot of rows so the query just spins and never returns. I suspect its because a lot of duplicates are being generated. Can anyone tell me how to fix the above query so that I can try to transport it onto my actual query?

Thanks,

3
  • What is your expected output from the sample data used in query? Commented Aug 4, 2022 at 18:47
  • I have updated my original post with the output. Commented Aug 4, 2022 at 18:56
  • that's not going to work. The connect by/regexp_count trick is for a single string, not multiple rows of strings. The CONNECT BY is working on the results of both rows. You have to process each row separately or create a pipelined function to do the processing Commented Aug 4, 2022 at 19:06

2 Answers 2

4

You need to include the ID in the connect-by clause; but for this to wotk you also need to include a non-deterministic function call; e.g.:

AND PRIOR id = id
AND PRIOR dbms_random.value IS NOT NULL

So

with qry as (
select 1 as id, '<div>What would you like to do today?<a href="#" data-jump="111">Client Service</a><a href="#" data-jump="112">Customer Service</a><a href="#" data-jump="113">Tecnical Service</a></div>' as html_string from dual
union
select 2 as id, '<div>What would you like to do today?<a href="#" data-jump="111">Client Service</a><a href="#" data-jump="112">Customer Service</a><a href="#" data-jump="113">Tecnical Service</a></div><a href="#" data-jump="114">Other Service</a></div>' as html_string from dual
)
SELECT ID,
REGEXP_SUBSTR(html_string, '<a.*?>(.*?).</a>', 1, LEVEL, NULL, 1) as contents,
REGEXP_SUBSTR(html_string, 'data-jump="(.*?)"', 1, LEVEL, NULL, 1) as data_jump
FROM qry
CONNECT BY LEVEL <= REGEXP_COUNT(html_string, '<a.*?>(.*?).</a>')
AND PRIOR id = id
AND PRIOR dbms_random.value IS NOT NULL
ID CONTENTS DATA_JUMP
1 Client Servic 111
1 Customer Servic 112
1 Tecnical Servic 113
2 Client Servic 111
2 Customer Servic 112
2 Tecnical Servic 113
2 Other Servic 114

db<>fiddle

Or you could use recursive subquery factoring instead of a hierarchical query, which is maybe easier to understand:

WITH rcte (id, html_string, lvl, contents, data_jump) AS (
  SELECT ID, html_string, 1,
    REGEXP_SUBSTR(html_string, '<a.*?>(.*?).</a>', 1, 1, NULL, 1),
    REGEXP_SUBSTR(html_string, 'data-jump="(.*?)"', 1, 1, NULL, 1)
  FROM qry
  UNION ALL
  SELECT ID, html_string, lvl + 1,
    REGEXP_SUBSTR(html_string, '<a.*?>(.*?).</a>', 1, lvl + 1, NULL, 1),
    REGEXP_SUBSTR(html_string, 'data-jump="(.*?)"', 1, lvl + 1, NULL, 1)
  FROM rcte
  WHERE REGEXP_SUBSTR(html_string, '<a.*?>(.*?).</a>', 1, lvl + 1, NULL, 1) IS NOT NULL
  OR REGEXP_SUBSTR(html_string, 'data-jump="(.*?)"', 1, lvl + 1, NULL, 1) IS NOT NULL
)
SELECT id, contents, data_jump
FROM rcte
ORDER BY id, lvl

db<>fiddle

Or even, preserving the last character of the 'contents' by removing the stray . from the regex as @Gary_W pointed out:

with qry as (
select 1 as id, '<div>What would you like to do today?<a href="#" data-jump="111">Client Service</a><a href="#" data-jump="112">Customer Service</a><a href="#" data-jump="113">Tecnical Service</a></div>' as html_string from dual
union
select 2 as id, '<div>What would you like to do today?<a href="#" data-jump="111">Client Service</a><a href="#" data-jump="112">Customer Service</a><a href="#" data-jump="113">Tecnical Service</a></div><a href="#" data-jump="114">Other Service</a></div>' as html_string from dual
),
rcte (id, html_string, lvl, contents, data_jump) AS (
  SELECT ID, html_string, 1,
    REGEXP_SUBSTR(html_string, '<a.*?>(.*?)</a>', 1, 1, NULL, 1),
    REGEXP_SUBSTR(html_string, 'data-jump="(.*?)"', 1, 1, NULL, 1)
  FROM qry
  UNION ALL
  SELECT ID, html_string, lvl + 1,
    REGEXP_SUBSTR(html_string, '<a.*?>(.*?)</a>', 1, lvl + 1, NULL, 1),
    REGEXP_SUBSTR(html_string, 'data-jump="(.*?)"', 1, lvl + 1, NULL, 1)
  FROM rcte
  WHERE REGEXP_SUBSTR(html_string, '<a.*?>(.*?)</a>', 1, lvl + 1, NULL, 1) IS NOT NULL
  OR REGEXP_SUBSTR(html_string, 'data-jump="(.*?)"', 1, lvl + 1, NULL, 1) IS NOT NULL
)
SELECT id, contents, data_jump
FROM rcte
ORDER BY id, lvl
ID CONTENTS DATA_JUMP
1 Client Service 111
1 Customer Service 112
1 Tecnical Service 113
2 Client Service 111
2 Customer Service 112
2 Tecnical Service 113
2 Other Service 114

db<>fiddle

Sign up to request clarification or add additional context in comments.

4 Comments

That is close, I'd be expecting 3 rows for ID 1 and 4 rows for ID2.
@CreationSL - sorry, wasn't paying attention... missed out a NOT. I've added a recursive CTE option too.
Your examples are amazing. I found I have a new issue with my data when using CONNECT BY LEVEL version that I'm wondering if it can be fixed. In my original example, I thought I would always have a "data-jump" element in the <a> tag. It appears the production data doesn't always have that element. So if I have 3 records and the first record is missing the data-jump element, the data-jump values from records 2 and 3 are appearing as they are data-jumps values for records 1 and 2. Is there a way to fix that?
@CreationSL - you could change the regex pattern, but you might be chasing edge cases forever. That's one of the problems with trying to parse XML with regular expressions; but your example can't be treated as XML.
0

Just cross join with a series of integers. Also more portable than CONNECT BY ...

-- your input, don't use in final query ...
WITH qry AS (
SELECT 1 AS id, '<div>What would you like to do today?<a href="#" data-jump="111">Client Service</a><a href="#" data-jump="112">Customer Service</a><a href="#" data-jump="113">Tecnical Service</a></div>' AS html_string FROM dual
UNION
SELECT 2 AS id, '<div>What would you like to do today?<a href="#" data-jump="111">Client Service</a><a href="#" data-jump="112">Customer Service</a><a href="#" data-jump="113">Tecnical Service</a></div><a href="#" data-jump="114">Other Service</a></div>' AS html_string FROM dual
)
-- end of input; real query starts here, replace following comma with "WITH"
,
-- need a bunch of consecutive integers ...
i(i) AS (
            SELECT 1 FROM dual
  UNION ALL SELECT 2 FROM dual
  UNION ALL SELECT 3 FROM dual
  UNION ALL SELECT 4 FROM dual
)
SELECT
  id
, REGEXP_SUBSTR(html_string,'(?<=data-jump=")\d+',1,i) AS data_jump
, REGEXP_SUBSTR(html_string,'(?<=\"\>)[^<]+',1,i)      AS content
FROM qry CROSS JOIN i
WHERE REGEXP_SUBSTR(html_string,'(?<=data-jump=")\d+',1,i) <> ''
ORDER BY id,i;
-- out  id | data_jump |     content      
-- out ----+-----------+------------------
-- out   1 | 111       | Client Service
-- out   1 | 112       | Customer Service
-- out   1 | 113       | Tecnical Service
-- out   2 | 111       | Client Service
-- out   2 | 112       | Customer Service
-- out   2 | 113       | Tecnical Service
-- out   2 | 114       | Other Service

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.