Python - Parse strings with variable repeating substring

Question

I am trying to do something which I thought would be simple (and probably is), however I am hitting a wall. I have a string that contains document numbers. In most cases the format is ######-#-### however in some cases, where the single digit should be, there are multiple single digits separated by a comma (i.e. ######-#,#,#-###). The number of single digits separated by a comma is variable. Below is an example:

For the string below:

('030421-1,2-001 & 030421-1-002,030421-1,2,3-002, 030421-1-003')

I need to return:

['030421-1-001', '030421-2-001' '030421-1-002', '030421-1-002', '030421-2-002', '030421-3-002' '030421-1-003']

I have only gotten as far as returning the strings that match the ######-#-### pattern:

import re
p = re.compile('\d{6}-\d{1}-\d{3}')
m = p.findall('030421-1,2-001 & 030421-1-002,030421-1,2,3-002, 030421-1-003')
print m

Thanks in advance for any help!

Matt

i don't know how findall func would modify your code.

Avinash Raj
– Avinash Raj

2014-12-30 18:51:57 +00:00
Commented Dec 30, 2014 at 18:51 — Avinash Raj
– Avinash Raj, Commented Dec 30, 2014 at 18:51

Ashwini Chaudhary · Accepted Answer · 2014-12-30 19:02:08Z

2

Perhaps something like this:

>>> import re
>>> s = '030421-1,2-001 & 030421-1-002,030421-1,2,3-002, 030421-1-003'
>>> it = re.finditer(r'(\b\d{6}-)(\d(?:,\d)*)(-\d{3})\b', s)
>>> for m in it:
    a, b, c = m.groups()
    for x in b.split(','):
        print a + x + c
...         
030421-1-001
030421-2-001
030421-1-002
030421-1-002
030421-2-002
030421-3-002
030421-1-003

Or using a list comprehension

>>> [a+x+c for a, b, c in (m.groups() for m in it) for x in b.split(',')]
['030421-1-001', '030421-2-001', '030421-1-002', '030421-1-002', '030421-2-002', '030421-3-002', '030421-1-003']

answered Dec 30, 2014 at 19:02

Ashwini Chaudhary

252k60 gold badges478 silver badges519 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Matthew Palavido Over a year ago

Awesome! That top example did the trick. I ran it on a test case of roughly 400 records and it did exactly what I needed. Thank you so much!

Igor Pejic Over a year ago

Impressive. Very nice.

Antoine Pietri · Accepted Answer · 2014-12-30 18:45:10Z

0

Use '\d{6}-\d(,\d)*-\d{3}'.

* means "as many as you want (0 included)". It is applied to the previous element, here '(,\d)'.

edited Dec 30, 2014 at 18:45

answered Dec 30, 2014 at 18:39

Antoine Pietri

8131 gold badge11 silver badges25 bronze badges

4 Comments

Matthew Palavido Over a year ago

Hi, thanks for the quick response. I tried that but it did not return what I was expecting. It returned the following: [',2', '', ',3', '']

Antoine Pietri Over a year ago

Yes, it returns the different matches, which are what is matched inside the different parenthesis. If you add parenthesis around the whole regexp: ('\d{6}-\d(,\d)*-\d{3}'), [x[0] for x in m] will give you what you want.

Matthew Palavido Over a year ago

thanks serialk, looks like we are getting closer, but still not quite there. adding the parentheses results in : [('030421-1,2-001', ',2'), ('030421-1-002', ''), ('030421-1,2,3-002', ',3'), ('030421-1-003', '')]

Antoine Pietri Over a year ago

Have you tried [x[0] for x in m] ? This gets you the first element of every tuple.

efreedom · Accepted Answer · 2014-12-30 18:51:10Z

0

I wouldn't use a single regular expression to try and parse this. Since it is essentially a list of strings, you might find it easier to replace the "&" with a comma globally in the string and then use split() to put the elements into a list.

Doing a loop of the list will allow you to write a single function to parse and fix the string and then you can push it onto a new list and the display your string.

replace(string, '&', ',')
initialList = string.split(',')
for item in initialList:
    newItem = myfunction(item)
    newList.append(newItem)

newstring = newlist(join(','))

answered Dec 30, 2014 at 18:51

efreedom

2533 silver badges8 bronze badges

2 Comments

Antoine Pietri Over a year ago

Not really the question, and the format is really easy to match with a regexp. I didn't downvote because your proposal makes sense but I don't think it's the best advice to follow. OP didn't specify a format, so maybe the general case is more complicated than you think.

Matthew Palavido Over a year ago

Thanks for the response. Serialk is correct, the general case can be way more complicated. The only given is that the string will contain substrings in the format of ######-#-### or ###-#(,# any number of times)-###. There may be all other types characters and text in the overall string (ex '0122.03, 0652.2 & 0652.5, ASSIGNMENT AND ASSUMPTION OF EASEMENT FOR POWER LINE PURPOSES, (EXPIRES WITH GROUND LEASE DATED 7/17/85 RECORDED AS INST NO. 167831 ON 7/30/85, SEE ALSO 030421-2-010) SEE 030421-2-020 & 030421-1-XXX'

Lanting · Accepted Answer · 2014-12-30 19:06:15Z

0

(\d{6}-)((?:\d,?)+)(-\d{3})

We take 3 capturing groups. We match the first part and last part the easy way. The center part is optionally repeated and optionally contains a ','. Regex will however only match the last one, so ?: won't store it at all. What where left with is the following result:

>>> p = re.compile('(\d{6}-)((?:\d,?)+)(-\d{3})')
>>> m = p.findall('030421-1,2-001 & 030421-1-002,030421-1,2,3-002, 030421-1-003')
>>> m
[('030421-', '1,2', '-001'), ('030421-', '1', '-002'), ('030421-', '1,2,3', '-002'),  ('030421-', '1', '-003')]

You'll have to manually process the 2nd term to split them up and join them, but a list comprehension should be able to do that.

answered Dec 30, 2014 at 19:06

Lanting

3,07814 silver badges30 bronze badges

Collectives™ on Stack Overflow

Python - Parse strings with variable repeating substring

4 Answers 4

2 Comments

4 Comments

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

2 Comments

4 Comments

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related