String parsing in Python, regular expression?

Question

I have text that is formatted as follows

|relevant text| followed by |variable number of white spaces| followed by |relevant text (a folder path containing white spaces| followed by |variable number of white spaces| followed by |not relevant text|

My goal is to retreive the two relevant text but I have no experience in dealing with regular expressions (I believe this is what I should use?)

Thanks in advance! :)

For example:

68465d1wd        C:\nice\ pro   g  ram   files\path.html          d   d5 d   w4d   w5 d   4wd46

I would want to retreive

foo = 68465d1wd

bar = path.html

can you put some example code? seems like you can use split() to do that — angvillar
– angvillar, Commented Jun 19, 2013 at 14:33
What resource are you using for your regex information? The techniques you need are all pretty straightforward. Use parentheses for the relevant text and ` *` for variable white space. You just need to figure out how to separate your folder path from the non-relevant text. — Peter Alfvin
– Peter Alfvin, Commented Jun 19, 2013 at 14:36
Are you sure these are multiple spaces and not tabs, by the way? — alexis
– alexis, Commented Jun 19, 2013 at 17:38

alexis · Accepted Answer · 2013-06-19 18:27:31Z

1

If your fields are separated by at least two spaces, this should do it:

import re
foo, bar, _irrelevant = re.split(r"\s{2,}",  line)

Edit: The above solution no longer works for the revised answer. If (as I gather from your comments) the filename always has a .php or .htm[l] extension, and there's always a path before the final filename, you can try your luck with the following:

foo, rest = re.split(r"\s{2,}",  line, 1)
bar = re.search(r"[^\\]*\.(?:php|html?)\b", rest).group(0)

This will give you everything after the last backslash preceding .php, .htm or .html. Basically there's a regexp for everything, but you need to figure out what your data looks like.

edited Jun 19, 2013 at 18:27

answered Jun 19, 2013 at 14:42

alexis

50.4k18 gold badges107 silver badges172 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Peter Alfvin Over a year ago

What if the filename contains two or more spaces in it?

zmo Over a year ago

here's a solution: foo, bar = (re.split(r"\s{2,}", line)[0], re.split(r"\s{2,}", line)[-2])

alexis Over a year ago

Two or more adjacent spaces in the filename? Then either the input is ambiguous, or there are other constraints we can leverage (e.g., if the filename always has an extension, the irrelevant text never contains backslashes, etc.))

Peter Alfvin Over a year ago

@Alexis, yes, as far as I know, if one space is allowed in a filename, then two adjacent spaces are allowed, although obviously that would be rare. BTW, it wasn't me who downvoted your answer. :-)

alexis Over a year ago

They're allowed, sure, but how often does this happen? I think the OP needs to tell us what the data actually looks like.

falsetru · Accepted Answer · 2013-06-19 14:49:26Z

1

>>> data = '''68465d1wd        C:\nice\ program files\path.html          dw6d5w4dw5d4wd46'''
>>> re.split(r'\s{2,}', data)
['68465d1wd', 'C:\nice\\ program files\\path.html', 'dw6d5w4dw5d4wd46']
>>> foo, bar = re.split(r'\s{2,}', data)[:2]
>>> foo
'68465d1wd'
>>> bar
'C:\nice\\ program files\\path.html'
>>> import os
>>> os.path.basename(bar)
'path.html'

Without regular expression:

>>> foo, rest = data.split(' ', 1)
>>> bar, rest = rest.lstrip().split('  ', 1)
>>> foo
'68465d1wd'
>>> bar
'C:\nice\\ program files\\path.html'
>>> os.path.basename(bar)
'path.html'

edited Jun 19, 2013 at 14:49

answered Jun 19, 2013 at 14:41

falsetru

371k69 gold badges769 silver badges659 bronze badges

Collectives™ on Stack Overflow

String parsing in Python, regular expression?

2 Answers 2

5 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

5 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related