parsing string - regex help in python

Question

Hi, I have this string in Python:

'Every Wednesday and Friday, this market is perfect for lunch! Nestled in the Minna St. tunnel (at 5th St.), this location is great for escaping the fog or rain. Check out live music every Friday.\r\n\r\nLocation: 5th St. @ Minna St.\r\nTime: 11:00am-2:00pm\r\n\r\nVendors:\r\nKasa Indian\r\nFiveten Burger\r\nHiyaaa\r\nThe Rib Whip\r\nMayo & Mustard\r\n\r\n\r\nCATERING NEEDS? Have OtG cater your next event! Get started by visiting offthegridsf.com/catering.'

I need to extract the following:

Location: 5th St. @ Minna St.
Time: 11:00am-2:00pm

Vendors:
Kasa Indian
Fiveten Burger
Hiyaaa
The Rib Whip
Mayo & Mustard

I tried to do this by using:

val = desc.split("\r\n")

and then val[2] gives the location, val[3] gives the time and val[6:11] gives the vendors. But I am sure there is a nicer, more efficient way to do this.

Any help will be highly appreciated.

I think you've got it right, actually. I assume this is part of a more general problem? ie, are there always going to be 5 vendors? Are there going to possibly be additional lines before the third, so that the time would be val[?]. Otherwise, you've got it right. — audiodude
– audiodude, Commented Feb 11, 2014 at 0:43

GVH · Accepted Answer · 2014-02-13 07:24:21Z

1

If your input is always going to formatted in exactly this way, using str.split() is preferable. If you want something slightly more resilient, here's a regex approach, using re.VERBOSE and re.DOTALL:

import re

desc_match = re.search(r'''(?sx)
    (?P<loc>Location:.+?)[\n\r]
    (?P<time>Time:.+?)[\n\r]
    (?P<vends>Vendors:.+?)(?:\n\r?){2}''', desc)

if desc_match:
    for gname in ['loc', 'time', 'vends']:
        print desc_match.group(gname)

Given your definition of desc, this prints out:

Location: 5th St. @ Minna St.
Time: 11:00am-2:00pm

Vendors:
Kasa Indian
Fiveten Burger
Hiyaaa
The Rib Whip
Mayo & Mustard

Efficiency really doesn't matter here because the time is going to be negligible either way (don't optimize unless there is a bottleneck.) And again, this is only "nicer" if it works more often than your solution using str.split() - that is, if there are any possible input strings for which your solution does not produce the correct result.

If you only want the values, just move the prefixes outside of the group definitions (a group is defined by (?P<group_name>...))

r'''(?sx)
    Location: \s* (?P<loc>.+?)   [n\r]
    Time:     \s* (?P<time>.+?)  [\n\r]
    Vendors:  \s* (?P<vends>.+?) (?:\n\r?){2}'''

edited Feb 13, 2014 at 7:24

answered Feb 11, 2014 at 1:07

GVH

4143 silver badges16 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

user2216194 Over a year ago

Thanks, this works better (i.e. more general). Do you know how I can use regex further to extract each of the values. I want to store the time, location and vendors in my model. I can do a split on ":" but then the time case won't work. And I want to make it part of a nested loop. Thanks!

GVH Over a year ago

Edited my answer. What is causing difficulty with nesting this inside a loop? It's not inefficient to repeatedly call re.search - Python keeps a cache of regular expressions so that it does not have to repeatedly compile the same one.

Hugh Bothwell · Accepted Answer · 2014-02-11 00:45:05Z

1

NLNL = "\r\n\r\n"

parts = s.split(NLNL)
result = NLNL.join(parts[1:3])
print(result)

which gives

Location: 5th St. @ Minna St.
Time: 11:00am-2:00pm

Vendors:
Kasa Indian
Fiveten Burger
Hiyaaa
The Rib Whip
Mayo & Mustard

answered Feb 11, 2014 at 0:45

Hugh Bothwell

57k9 gold badges91 silver badges103 bronze badges

Collectives™ on Stack Overflow

parsing string - regex help in python

2 Answers 2

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related