3

I need to parse lines having multiple language codes as below

008800002     Bruxelles-Nord$Br�ussel Nord$<deu>$Brussel Noord$<nld>
  • 008800002 being a id
  • Bruxelles-Nord$Br�ussel Nord$ being name1
  • deu being language one
  • $Brussel Noord$ being name two
  • nld being language two.

SO, the idea is name and language can appear N number of times. I need to collect them all. the language in <> is 3 characters in length (fixed) and all names end with $ sign.

I tried this one but it is not giving expected output.

x = re.compile('(?P<stop_id>\d{9})\s(?P<authority>[[\x00-\x7F]{3}|\s{3}])\s(?P<stop_name>.*)
    (?P<lang_code>(?:[<]\S{0,4}))',flags=re.UNICODE)

I have no idea how to get repeated elements. It takes

Bruxelles-Nord$Br�ussel Nord$<deu>$Brussel Noord$ as stop_name and <nld> as language.

1
  • 2
    You might want to fix encoding issues first. It's Brüssel, not Br�ussel. Commented Oct 1, 2014 at 9:48

2 Answers 2

3

Do it in two steps. First separate ID from name/language pairs; then use re.finditer on the name/language section to iterate over the pairs and stuff them into a dict.

import re

line = u"008800002     Bruxelles-Nord$Br�ussel Nord$<deu>$Brussel Noord$<nld>"
m = re.search("(\d+)\s+(.*)", line, re.UNICODE)
id = m.group(1)
names = {}
for m in re.finditer("(.*?)<(.*?)>", m.group(2), re.UNICODE):
    names[m.group(2)] = m.group(1)
print id, names
Sign up to request clarification or add additional context in comments.

Comments

2
\b(\d+)\b\s*|(.*?)(?=<)<(.*?)>

Try this.Just grab the captures.see demo.

http://regex101.com/r/hS3dT7/4

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.