3

I am trying to do something which I thought would be simple (and probably is), however I am hitting a wall. I have a string that contains document numbers. In most cases the format is ######-#-### however in some cases, where the single digit should be, there are multiple single digits separated by a comma (i.e. ######-#,#,#-###). The number of single digits separated by a comma is variable. Below is an example:

For the string below:

('030421-1,2-001 & 030421-1-002,030421-1,2,3-002, 030421-1-003')

I need to return:

['030421-1-001', '030421-2-001' '030421-1-002', '030421-1-002', '030421-2-002', '030421-3-002' '030421-1-003']

I have only gotten as far as returning the strings that match the ######-#-### pattern:

import re
p = re.compile('\d{6}-\d{1}-\d{3}')
m = p.findall('030421-1,2-001 & 030421-1-002,030421-1,2,3-002, 030421-1-003')
print m

Thanks in advance for any help!

Matt

1
  • i don't know how findall func would modify your code. Commented Dec 30, 2014 at 18:51

4 Answers 4

2

Perhaps something like this:

>>> import re
>>> s = '030421-1,2-001 & 030421-1-002,030421-1,2,3-002, 030421-1-003'
>>> it = re.finditer(r'(\b\d{6}-)(\d(?:,\d)*)(-\d{3})\b', s)
>>> for m in it:
    a, b, c = m.groups()
    for x in b.split(','):
        print a + x + c
...         
030421-1-001
030421-2-001
030421-1-002
030421-1-002
030421-2-002
030421-3-002
030421-1-003

Or using a list comprehension

>>> [a+x+c for a, b, c in (m.groups() for m in it) for x in b.split(',')]
['030421-1-001', '030421-2-001', '030421-1-002', '030421-1-002', '030421-2-002', '030421-3-002', '030421-1-003']
Sign up to request clarification or add additional context in comments.

2 Comments

Awesome! That top example did the trick. I ran it on a test case of roughly 400 records and it did exactly what I needed. Thank you so much!
Impressive. Very nice.
0

Use '\d{6}-\d(,\d)*-\d{3}'.

* means "as many as you want (0 included)". It is applied to the previous element, here '(,\d)'.

4 Comments

Hi, thanks for the quick response. I tried that but it did not return what I was expecting. It returned the following: [',2', '', ',3', '']
Yes, it returns the different matches, which are what is matched inside the different parenthesis. If you add parenthesis around the whole regexp: ('\d{6}-\d(,\d)*-\d{3}'), [x[0] for x in m] will give you what you want.
thanks serialk, looks like we are getting closer, but still not quite there. adding the parentheses results in : [('030421-1,2-001', ',2'), ('030421-1-002', ''), ('030421-1,2,3-002', ',3'), ('030421-1-003', '')]
Have you tried [x[0] for x in m] ? This gets you the first element of every tuple.
0

I wouldn't use a single regular expression to try and parse this. Since it is essentially a list of strings, you might find it easier to replace the "&" with a comma globally in the string and then use split() to put the elements into a list.

Doing a loop of the list will allow you to write a single function to parse and fix the string and then you can push it onto a new list and the display your string.

replace(string, '&', ',')
initialList = string.split(',')
for item in initialList:
    newItem = myfunction(item)
    newList.append(newItem)

newstring = newlist(join(','))

2 Comments

Not really the question, and the format is really easy to match with a regexp. I didn't downvote because your proposal makes sense but I don't think it's the best advice to follow. OP didn't specify a format, so maybe the general case is more complicated than you think.
Thanks for the response. Serialk is correct, the general case can be way more complicated. The only given is that the string will contain substrings in the format of ######-#-### or ###-#(,# any number of times)-###. There may be all other types characters and text in the overall string (ex '0122.03, 0652.2 & 0652.5, ASSIGNMENT AND ASSUMPTION OF EASEMENT FOR POWER LINE PURPOSES, (EXPIRES WITH GROUND LEASE DATED 7/17/85 RECORDED AS INST NO. 167831 ON 7/30/85, SEE ALSO 030421-2-010) SEE 030421-2-020 & 030421-1-XXX'
0

(\d{6}-)((?:\d,?)+)(-\d{3})

We take 3 capturing groups. We match the first part and last part the easy way. The center part is optionally repeated and optionally contains a ','. Regex will however only match the last one, so ?: won't store it at all. What where left with is the following result:

>>> p = re.compile('(\d{6}-)((?:\d,?)+)(-\d{3})')
>>> m = p.findall('030421-1,2-001 & 030421-1-002,030421-1,2,3-002, 030421-1-003')
>>> m
[('030421-', '1,2', '-001'), ('030421-', '1', '-002'), ('030421-', '1,2,3', '-002'),  ('030421-', '1', '-003')]

You'll have to manually process the 2nd term to split them up and join them, but a list comprehension should be able to do that.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.