0

I'm trying to parse the two following strings in python:

Here's the first string

s1="< one > < two > < three > here's one attribute < six : 10.3 > < seven : 8.5 > < eight : 90.1 > < nine : 8.7 >"

I need a re so that I can split and store the above in a list like this where each item in a new line below is an element at a particular index in the list:

0 one    
1 two 
2 three
3 here's one attribute
4 six : 10.3
5 seven : 8.5
6 eight : 90.1
7 nine : 8.7

Here's the second string

s2="<one><two><three> an.attribute ::"

So similarly, i need the items stored in a list like this:

0 one
1 two
2 three
3 an.attribute

Here's what I've tried so far, the re is an answer I got from another question I posted on Stack Overflow.

res = re.findall('< (.*?) >', s1)
pprint(res)
index=0
for index in res:
    print index

but that skips "here's one attribute"

output:

['one', 'two', 'three', 'six : 10.3', 'seven : 8.5', 'eight : 90.1', 'nine : 8.7']
one
two
three
six : 10.3
seven : 8.5
eight : 90.1
nine : 8.7

Could anyone help me out? =)

If anyone knows how to extract the numerical values from the string like 10.3, 8.5, 90.1 and 8.7 from the first string too that would be great too =)

EDIT: Duncan I tried your code but I don't seem to be getting the output like I should. I assume I've made some sort of error somewhere. could you tell me what it is?

from __future__ import generators
from pprint import pprint
s2="<one><two><three> an.attribute ::"
s1="< one > < two > < three > here's one attribute < six : 10.3 > < seven : 8.5 > <   eight : 90.1 > < nine : 8.7 >"
def parse(s):
    for t in s.split('<'):
        for u in t.strip().split('>',1):
            if u.strip(): yield u.strip()

list(parse(s1))
list(parse(s2))
pprint(s1)
pprint(s2)

Here's the output I'm getting:

"< one > < two > < three > here's one attribute < six : 10.3 > < seven : 8.5 > < eight : 90.1 > < nine : 8.7 >"
'<one><two><three> an.attribute ::'
3
  • docs.python.org/2/library/re.html Commented Mar 13, 2013 at 10:37
  • @daveoncode i've updated my question, could you have another look? :) Commented Mar 13, 2013 at 10:43
  • @vartec thanks for the link, I've had a look at that but haven't figured it out yet Commented Mar 13, 2013 at 10:44

2 Answers 2

2

This gets all the stuff, I'm sure you can add some if statements and tweaks to get the exact output desired

c = 0
m=re.compile('< (\w+) (: ([\d.]+))* *> ([^<]*)')
for r in m.finditer(s1):
    c = c + 1
    (tag,junk,number,attribute)=r.groups()
    print c, attribute

EDIT: more explaination

The re.compile line prepares a regexp for use To break down what this regexp does, first you have to understand that the ( ) round brackets mark the items that will end up in the results (r.groups())

So the expression < (\w+) means find a <, then a space, then start a capture group The capture group contains one or more "word characters", stuff like a to z

This is how the tag is found

Next bit is (: ([\d.]+))* Again a capture group is started, then a : must be present, then another capture group, they are allowed to be inside each other. The square brackets [] define a character class and the \d is a match for digits. The . in this context is just a dot! So the class will match anything that is a digit or a dot The + means "1 or more the previous expression" so it's one or more digits or dots. This is to get the number. Finally after the round brackets close the capture groups there is an asterix * This means capture zero or more of the previous expression. This has the effect of making the previous group optional. Not all the tags have numbers.

I'll stop my explaination of the regexp there. There are many great resources for learning how to construct regexp

The finditer simply repeats the regexp on the string and finds the matches from it

The expression (tag,junk,number,attribute)=r.groups() means copy the list result (r.group) to the individual variables tag,junk,number and attribute

Sign up to request clarification or add additional context in comments.

Comments

1

Here's a quick solution that doesn't use regular expressions at all:

def parse(s):
    for t in s.split('<'):
        for u in t.strip().split('>',1):
            if u.strip(): yield u.strip()

>>> list(parse(s1))
['one', 'two', 'three', "here's one attribute", 'six : 10.3', 'seven : 8.5', 'eight : 90.1', 'nine : 8.7']
>>> list(parse("<one><two><three> an.attribute ::"))
['one', 'two', 'three', 'an.attribute ::']

>>> from pprint import pprint
>>> pprint(list(parse(s1)))
['one',
 'two',
 'three',
 "here's one attribute",
 'six : 10.3',
 'seven : 8.5',
 'eight : 90.1',
 'nine : 8.7']

You could even write it as a single list comprehension, but I wouldn't recommend it:

>>> [ u.strip() for t in s1.split('<') for u in t.strip().split('>',1) if u.strip() ]
['one', 'two', 'three', "here's one attribute", 'six : 10.3', 'seven : 8.5', 'eight : 90.1', 'nine : 8.7']

2 Comments

could you take a look at the edit I've added in my question :)
Sure. You called list(parse(s1) and then just printed s1. Try pprint(list(parse(s1))) to see the parsed output.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.