1

Hi, I have this string in Python:

'Every Wednesday and Friday, this market is perfect for lunch! Nestled in the Minna St. tunnel (at 5th St.), this location is great for escaping the fog or rain. Check out live music every Friday.\r\n\r\nLocation: 5th St. @ Minna St.\r\nTime: 11:00am-2:00pm\r\n\r\nVendors:\r\nKasa Indian\r\nFiveten Burger\r\nHiyaaa\r\nThe Rib Whip\r\nMayo & Mustard\r\n\r\n\r\nCATERING NEEDS? Have OtG cater your next event! Get started by visiting offthegridsf.com/catering.'

I need to extract the following:

Location: 5th St. @ Minna St.
Time: 11:00am-2:00pm

Vendors:
Kasa Indian
Fiveten Burger
Hiyaaa
The Rib Whip
Mayo & Mustard

I tried to do this by using:

val = desc.split("\r\n")

and then val[2] gives the location, val[3] gives the time and val[6:11] gives the vendors. But I am sure there is a nicer, more efficient way to do this.

Any help will be highly appreciated.

1
  • I think you've got it right, actually. I assume this is part of a more general problem? ie, are there always going to be 5 vendors? Are there going to possibly be additional lines before the third, so that the time would be val[?]. Otherwise, you've got it right. Commented Feb 11, 2014 at 0:43

2 Answers 2

1

If your input is always going to formatted in exactly this way, using str.split() is preferable. If you want something slightly more resilient, here's a regex approach, using re.VERBOSE and re.DOTALL:

import re

desc_match = re.search(r'''(?sx)
    (?P<loc>Location:.+?)[\n\r]
    (?P<time>Time:.+?)[\n\r]
    (?P<vends>Vendors:.+?)(?:\n\r?){2}''', desc)

if desc_match:
    for gname in ['loc', 'time', 'vends']:
        print desc_match.group(gname)

Given your definition of desc, this prints out:

Location: 5th St. @ Minna St.
Time: 11:00am-2:00pm

Vendors:
Kasa Indian
Fiveten Burger
Hiyaaa
The Rib Whip
Mayo & Mustard

Efficiency really doesn't matter here because the time is going to be negligible either way (don't optimize unless there is a bottleneck.) And again, this is only "nicer" if it works more often than your solution using str.split() - that is, if there are any possible input strings for which your solution does not produce the correct result.

If you only want the values, just move the prefixes outside of the group definitions (a group is defined by (?P<group_name>...))

r'''(?sx)
    Location: \s* (?P<loc>.+?)   [n\r]
    Time:     \s* (?P<time>.+?)  [\n\r]
    Vendors:  \s* (?P<vends>.+?) (?:\n\r?){2}'''
Sign up to request clarification or add additional context in comments.

2 Comments

Thanks, this works better (i.e. more general). Do you know how I can use regex further to extract each of the values. I want to store the time, location and vendors in my model. I can do a split on ":" but then the time case won't work. And I want to make it part of a nested loop. Thanks!
Edited my answer. What is causing difficulty with nesting this inside a loop? It's not inefficient to repeatedly call re.search - Python keeps a cache of regular expressions so that it does not have to repeatedly compile the same one.
1
NLNL = "\r\n\r\n"

parts = s.split(NLNL)
result = NLNL.join(parts[1:3])
print(result)

which gives

Location: 5th St. @ Minna St.
Time: 11:00am-2:00pm

Vendors:
Kasa Indian
Fiveten Burger
Hiyaaa
The Rib Whip
Mayo & Mustard

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.