Regex function to extract numbers from string-type values, separated by "-", to get the difference of each two values

Question

I'm 100% new to Regex so I've been floundering about on regex101 trying to figure out how to get my desired output. I am using PostgreSQL to write a query to extract a set of values from the string. Once extracted, I need to convert them to int types and then take the difference between the two values.

A sample of the data I am working with can be found here:
https://regex101.com/r/Twkphj/3 (Each line break is a new record/value of the data.) https://dbfiddle.uk/ELitHDni

create table t(id int generated always as identity primary key, data text);
insert into t(data)values
('01-08,24-32')
,('38-70')
,('01-25, 27-38')
,('1-6,13-20,25-32')
,('1-4, 7-8, 11-12')
,('1-83,85-112')
,(NULL)
,('NULL')
,('162-169')
,('145-167, 169-214, 217-218, 247-254, 256-257, 382')
,('01-17, 23-27')
,('73-120, 145-192, 217-264, 289-336, 361-408, 433-480, 505-552, 577-624, 649-696, 721-768')
,('1-33, 37-45');

The end goal is to get an output like this:

SELECT
data, 
regex(difference of "data" points)
FROM table

data	difference (inclusive)
01-08,24-32	8,9
145-167, 169-214, 217-218, 247-254, 256-257, 382	23, 46, 2, 8, 2, 1

From here, I can just use split() if the stakeholder needs to break it down further.

Again I'm not to familiar with regex so regex101 has been great to breakdown and understand why certain "tokens"(?) are used. I think I need to stick to PCRE2 if that matters.

TIA

Zegarek · Accepted Answer · 2025-11-20 08:06:34Z

4

Convert your strings to arrays using string_to_array().
Explode the arrays with unnest().
Split each element at the - subtraction operator using split_part().
Filter out invalid numbers using pg_input_is_valid(). If you're on Postgres version 15 or lower, you can backport it.
Cast what's left into ::integers and run your subtraction, adding 1 to make it inclusive.
Re-aggregate using string_agg(), packing the results back into your CSV-like format. Adding with ordinality to unnest() gave you numbers indicating the original order of elements, which you can now use in aggregate function's internal order by to make sure the order of results in those output arrays matches that of the inputs.

select data, string_agg((1+b::int-a::int)::text,',' order by ord)
from t
cross join lateral unnest(string_to_array(data,','))with ordinality as u(a_range,ord)
inner join lateral(select split_part(a_range,'-',1) as a
                         ,split_part(a_range,'-',2) as b)as elements
   on pg_input_is_valid(a,'int')
  and pg_input_is_valid(b,'int')
group by id;

_{demo at db<>fiddle}

data	string_agg
01-08,24-32	8,9
38-70	33
01-25, 27-38	25,12
1-6,13-20,25-32	6,8,8
1-4, 7-8, 11-12	4,2,2
1-83,85-112	83,28
162-169	8
145-167, 169-214, 217-218, 247-254, 256-257, 382	23,46,2,8,2
01-17, 23-27	17,5
73-120, 145-192, 217-264, 289-336, 361-408, 433-480, 505-552, 577-624, 649-696, 721-768	48,48,48,48,48,48,48,48,48,48
1-33, 37-45	33,9

In addition to everything above, you can consider handling some more cases:

greatest()-least() gives you absolute difference even if the larger number in a pair is listed first. Same can be done with 1+abs(b-a).
to support fractions, you'll probably need to figure out how to switch between adding 0.1, 1.0 and 1.1 to make
- 0.2-0.4 a difference of 0.3
- 0.2-1.4 a difference of 1.3
instead of filtering out non-numbers, you can nullify them with a case and turn into a zero using coalesce()
make scalars work like 1-element ranges, meaning that 5-5 is also accepted as 5 and results in a difference of 1. A num_nonnulls(a,b) should help with this point and the one above.
I try to avoid old-style comma joins when posting here, but otherwise I agree with @Lukasz Szozda using them to shorten things by reducing cross join (lateral) down to just ,.

select quote_literal(data)
      ,string_agg(coalesce(1+abs(a-b), num_nonnulls(a,b))::text,',' order by ord)
from t
,unnest(string_to_array(data,','))with ordinality as u(a_range,ord)
,lateral(select split_part(a_range,'-',1) as a_
               ,split_part(a_range,'-',2) as b_)as raw_elements
,lateral(select(case when pg_input_is_valid(a_,'numeric')then a_ end)::numeric as a
              ,(case when pg_input_is_valid(b_,'numeric')then b_ end)::numeric as b)_
group by id;

quote_literal	string_agg
'01-08,24-32'	8,9
'NULL'	0
'5, 7'	1,1
'29-23'	7
','	0,0
' , '	0,0
'11, bleh, 13, 17-19'	1,0,1,3
'0xFff-0xAB, 0b0101-0b0111, .2-.5'	3925,3,1.3
'NaN, infinity, 5-infinity'	1,1,Infinity

If you need to produce outputs for null inputs, the commas/cross joins need to be swapped out for left join..on true. If you switch your result column type from a comma-separated string to a proper int[] array, those let you store an actual null at positions where the input pair is invalid or also null. They're also lighter, faster and easier to work with overall.

I assumed you're using integers but if you consider making these numeric or float, you'd also be able to turn invalid inputs into NaN, in addition to allowing infinity and -infinity.

The range types mentioned by @charlieface might be a good idea. If you store those inputs, saving them as nummmultirange makes them lighter, converts them to canonical form and unlocks native functions and operators, as well as indexed search and exclude or without overlaps constraints.

edited 2 days ago

answered Nov 17 at 21:01

Zegarek

29.9k5 gold badges27 silver badges32 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

snicksnackpaddywhack91 2 days ago

this is great. I managed to get it to work for my use case- I have a follow up question/query if you don't mind. I'm not sure if I should reopen the question and edit, just ask here, or make a new post entirely. Which one is best practice/preferred?

Zegarek 2 days ago

New one with a link back to this one, some context and a comment here linking the new one is always preferred. But it's genuinely the first time I see someone ask that and I usually answer regardless, so I'm happy to leave it up to you and whether you have time and patience for good practice.

snicksnackpaddywhack91 2 days ago

I actually managed to figure out my own question so I'm happy I'm at least learning something. I was doing some validation on the outputs and did notice one thing I'm not able to resolve from your code. It looks like I'm not capturing the single page values. I didn't notice it when I parsed your explanation the first time but in your example out put '5, 7' should return 1,1 and '11, bleh, 13, 17-19' should return 1, 0, 1, 3. Any changes we could make to the code to not exclude single page values?

Zegarek 2 days ago

They are currently zero because if this doesn't find the second operand, it ends up doing null subtraction, and then coalesce() swaps the resulting null for a zero. You can instead make sure it's only 0 or NaN for 2 null operands, 1 if 1 operand is not null, the inclusive diff if neither is null. You can use a coalesce(1+abs(b-a),num_nonnulls(a,b)). dbfiddle.uk/ylsuHWdj?hide=23

Lukasz Szozda · Accepted Answer · 2025-11-17 21:03:59Z

3

Using STRING_TO_ARRAY:

SELECT d, 
  STRING_AGG(COALESCE((NULLIF(SPLIT_PART(c, '-',2), '')::INT 
                     - SPLIT_PART(c, '-',1)::INT + 1)::TEXT, '1'),
  ',' ORDER BY ordinality) AS diff
FROM t  
,UNNEST(string_to_array(REPLACE(t.d,' ', ''), ',')) WITH ORDINALITY AS c
GROUP BY d;

Output:

+---------------------------------------------------+---------------+
|                        d                          |     diff      |
+---------------------------------------------------+---------------+
| 01-08,24-32                                       |           8,9 |
| 145-167, 169-214, 217-218, 247-254, 256-257, 382  | 23,46,2,8,2,1 |
+---------------------------------------------------+---------------+

db<>fiddle demo

answered Nov 17 at 21:03

Lukasz Szozda

181k26 gold badges278 silver badges326 bronze badges

1 Comment

Zegarek Nov 17 at 22:11

Looks like we did the same thing, the same way, 2 minutes apart. I ended up overshooting a bit in fear you'll be back to finish the line of thought you had with nullif(). +1

Charlieface · Accepted Answer · 2025-11-18 01:14:04Z

3

The table-valued function regexp_matches will give you all matches in a string as a rowset (as long as you pass the 'g' flag.)

So we can use that inside a lateral join to break out each range of the string separately. The regex (\d+)(?:-(\d+))? will match each range, conditionally matching the second part of it as well. \d is a digit, + means greedy match as many as possible, () is a capturing group (returned by the function) and (?:) is a non-capturing group (not returned by the function).

Then we can simply cast the resulting array values to int, do some arithemtic, and aggregate back up using array_agg. We do all of this inside a lateral join so we aggregate up per outer row.

select *
from t
cross join lateral (
    select
      array_agg(
        coalesce(r.matches[2]::int, r.matches[1]::int)
        - r.matches[1]::int
        + 1
      ) as difference_incl
    from regexp_matches(t.data, '(\d+)(?:-(\d+))?', 'g') as r(matches)
) as r;

db<>fiddle

Consider converting your data column to the int4range[] or int4multirange type.

edited Nov 18 at 1:14

answered Nov 18 at 1:08

Charlieface

78.5k8 gold badges35 silver badges77 bronze badges

2 Comments

Zegarek Nov 18 at 11:26

+1 for the range types and collecting into a proper array, -1 for the regex-based numeric input validation. Repeating my responses to your now removed comment, regex would need to emulate PostgreSQL's own numeric input validation rules. The current pattern can't handle whitespace, fractions, underscore separators, exponential or non-base-10 notation, NaN or ±infinity. Still, if OP's input comes in guaranteed format, this method could be good enough, although a bit slow. Ultimately +1 from me (+1-1+1).

Charlieface Nov 18 at 12:29

Correct, I've assumed each range is a range of positive integers separated by just a -. Whitespace between ranges does work in my example. If OP has cases that don't fit with this then they should show that, I can only deal with what I'm given.

Collectives™ on Stack Overflow

Regex function to extract numbers from string-type values, separated by "-", to get the difference of each two values

3 Answers 3

6 Comments

1 Comment

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

6 Comments

1 Comment

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related