5

I'm 100% new to Regex so I've been floundering about on regex101 trying to figure out how to get my desired output. I am using PostgreSQL to write a query to extract a set of values from the string. Once extracted, I need to convert them to int types and then take the difference between the two values.

A sample of the data I am working with can be found here:
https://regex101.com/r/Twkphj/3 (Each line break is a new record/value of the data.) https://dbfiddle.uk/ELitHDni

create table t(id int generated always as identity primary key, data text);
insert into t(data)values
('01-08,24-32')
,('38-70')
,('01-25, 27-38')
,('1-6,13-20,25-32')
,('1-4, 7-8, 11-12')
,('1-83,85-112')
,(NULL)
,('NULL')
,('162-169')
,('145-167, 169-214, 217-218, 247-254, 256-257, 382')
,('01-17, 23-27')
,('73-120, 145-192, 217-264, 289-336, 361-408, 433-480, 505-552, 577-624, 649-696, 721-768')
,('1-33, 37-45');

The end goal is to get an output like this:

SELECT
data, 
regex(difference of "data" points)
FROM table
data difference (inclusive)
01-08,24-32 8,9
145-167, 169-214, 217-218, 247-254, 256-257, 382 23, 46, 2, 8, 2, 1

From here, I can just use split() if the stakeholder needs to break it down further.

Again I'm not to familiar with regex so regex101 has been great to breakdown and understand why certain "tokens"(?) are used. I think I need to stick to PCRE2 if that matters.

TIA

0

3 Answers 3

4
  1. Convert your strings to arrays using string_to_array().
  2. Explode the arrays with unnest().
  3. Split each element at the - subtraction operator using split_part().
  4. Filter out invalid numbers using pg_input_is_valid(). If you're on Postgres version 15 or lower, you can backport it.
  5. Cast what's left into ::integers and run your subtraction, adding 1 to make it inclusive.
  6. Re-aggregate using string_agg(), packing the results back into your CSV-like format. Adding with ordinality to unnest() gave you numbers indicating the original order of elements, which you can now use in aggregate function's internal order by to make sure the order of results in those output arrays matches that of the inputs.
select data, string_agg((1+b::int-a::int)::text,',' order by ord)
from t
cross join lateral unnest(string_to_array(data,','))with ordinality as u(a_range,ord)
inner join lateral(select split_part(a_range,'-',1) as a
                         ,split_part(a_range,'-',2) as b)as elements
   on pg_input_is_valid(a,'int')
  and pg_input_is_valid(b,'int')
group by id;

demo at db<>fiddle

data string_agg
01-08,24-32 8,9
38-70 33
01-25, 27-38 25,12
1-6,13-20,25-32 6,8,8
1-4, 7-8, 11-12 4,2,2
1-83,85-112 83,28
162-169 8
145-167, 169-214, 217-218, 247-254, 256-257, 382 23,46,2,8,2
01-17, 23-27 17,5
73-120, 145-192, 217-264, 289-336, 361-408, 433-480, 505-552, 577-624, 649-696, 721-768 48,48,48,48,48,48,48,48,48,48
1-33, 37-45 33,9

In addition to everything above, you can consider handling some more cases:

  • greatest()-least() gives you absolute difference even if the larger number in a pair is listed first. Same can be done with 1+abs(b-a).
  • to support fractions, you'll probably need to figure out how to switch between adding 0.1, 1.0 and 1.1 to make
    • 0.2-0.4 a difference of 0.3
    • 0.2-1.4 a difference of 1.3
  • instead of filtering out non-numbers, you can nullify them with a case and turn into a zero using coalesce()
  • make scalars work like 1-element ranges, meaning that 5-5 is also accepted as 5 and results in a difference of 1. A num_nonnulls(a,b) should help with this point and the one above.
  • I try to avoid old-style comma joins when posting here, but otherwise I agree with @Lukasz Szozda using them to shorten things by reducing cross join (lateral) down to just ,.
select quote_literal(data)
      ,string_agg(coalesce(1+abs(a-b), num_nonnulls(a,b))::text,',' order by ord)
from t
,unnest(string_to_array(data,','))with ordinality as u(a_range,ord)
,lateral(select split_part(a_range,'-',1) as a_
               ,split_part(a_range,'-',2) as b_)as raw_elements
,lateral(select(case when pg_input_is_valid(a_,'numeric')then a_ end)::numeric as a
              ,(case when pg_input_is_valid(b_,'numeric')then b_ end)::numeric as b)_
group by id;
quote_literal string_agg
'01-08,24-32' 8,9
'NULL' 0
'5, 7' 1,1
'29-23' 7
',' 0,0
' , ' 0,0
'11, bleh, 13, 17-19' 1,0,1,3
'0xFff-0xAB, 0b0101-0b0111, .2-.5' 3925,3,1.3
'NaN, infinity, 5-infinity' 1,1,Infinity

If you need to produce outputs for null inputs, the commas/cross joins need to be swapped out for left join..on true. If you switch your result column type from a comma-separated string to a proper int[] array, those let you store an actual null at positions where the input pair is invalid or also null. They're also lighter, faster and easier to work with overall.

I assumed you're using integers but if you consider making these numeric or float, you'd also be able to turn invalid inputs into NaN, in addition to allowing infinity and -infinity.

The range types mentioned by @charlieface might be a good idea. If you store those inputs, saving them as nummmultirange makes them lighter, converts them to canonical form and unlocks native functions and operators, as well as indexed search and exclude or without overlaps constraints.

Sign up to request clarification or add additional context in comments.

6 Comments

this is great. I managed to get it to work for my use case- I have a follow up question/query if you don't mind. I'm not sure if I should reopen the question and edit, just ask here, or make a new post entirely. Which one is best practice/preferred?
New one with a link back to this one, some context and a comment here linking the new one is always preferred. But it's genuinely the first time I see someone ask that and I usually answer regardless, so I'm happy to leave it up to you and whether you have time and patience for good practice.
I actually managed to figure out my own question so I'm happy I'm at least learning something. I was doing some validation on the outputs and did notice one thing I'm not able to resolve from your code. It looks like I'm not capturing the single page values. I didn't notice it when I parsed your explanation the first time but in your example out put '5, 7' should return 1,1 and '11, bleh, 13, 17-19' should return 1, 0, 1, 3. Any changes we could make to the code to not exclude single page values?
They are currently zero because if this doesn't find the second operand, it ends up doing null subtraction, and then coalesce() swaps the resulting null for a zero. You can instead make sure it's only 0 or NaN for 2 null operands, 1 if 1 operand is not null, the inclusive diff if neither is null. You can use a coalesce(1+abs(b-a),num_nonnulls(a,b)). dbfiddle.uk/ylsuHWdj?hide=23
3

Using STRING_TO_ARRAY:

SELECT d, 
  STRING_AGG(COALESCE((NULLIF(SPLIT_PART(c, '-',2), '')::INT 
                     - SPLIT_PART(c, '-',1)::INT + 1)::TEXT, '1'),
  ',' ORDER BY ordinality) AS diff
FROM t  
,UNNEST(string_to_array(REPLACE(t.d,' ', ''), ',')) WITH ORDINALITY AS c
GROUP BY d;

Output:

+---------------------------------------------------+---------------+
|                        d                          |     diff      |
+---------------------------------------------------+---------------+
| 01-08,24-32                                       |           8,9 |
| 145-167, 169-214, 217-218, 247-254, 256-257, 382  | 23,46,2,8,2,1 |
+---------------------------------------------------+---------------+

db<>fiddle demo

1 Comment

Looks like we did the same thing, the same way, 2 minutes apart. I ended up overshooting a bit in fear you'll be back to finish the line of thought you had with nullif(). +1
3

The table-valued function regexp_matches will give you all matches in a string as a rowset (as long as you pass the 'g' flag.)

So we can use that inside a lateral join to break out each range of the string separately. The regex (\d+)(?:-(\d+))? will match each range, conditionally matching the second part of it as well. \d is a digit, + means greedy match as many as possible, () is a capturing group (returned by the function) and (?:) is a non-capturing group (not returned by the function).

Then we can simply cast the resulting array values to int, do some arithemtic, and aggregate back up using array_agg. We do all of this inside a lateral join so we aggregate up per outer row.

select *
from t
cross join lateral (
    select
      array_agg(
        coalesce(r.matches[2]::int, r.matches[1]::int)
        - r.matches[1]::int
        + 1
      ) as difference_incl
    from regexp_matches(t.data, '(\d+)(?:-(\d+))?', 'g') as r(matches)
) as r;

db<>fiddle

Consider converting your data column to the int4range[] or int4multirange type.

2 Comments

+1 for the range types and collecting into a proper array, -1 for the regex-based numeric input validation. Repeating my responses to your now removed comment, regex would need to emulate PostgreSQL's own numeric input validation rules. The current pattern can't handle whitespace, fractions, underscore separators, exponential or non-base-10 notation, NaN or ±infinity. Still, if OP's input comes in guaranteed format, this method could be good enough, although a bit slow. Ultimately +1 from me (+1-1+1).
Correct, I've assumed each range is a range of positive integers separated by just a -. Whitespace between ranges does work in my example. If OP has cases that don't fit with this then they should show that, I can only deal with what I'm given.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.