0

I have text that is formatted as follows

|relevant text| followed by |variable number of white spaces| followed by |relevant text (a folder path containing white spaces| followed by |variable number of white spaces| followed by |not relevant text|

My goal is to retreive the two relevant text but I have no experience in dealing with regular expressions (I believe this is what I should use?)

Thanks in advance! :)

For example:

68465d1wd        C:\nice\ pro   g  ram   files\path.html          d   d5 d   w4d   w5 d   4wd46

I would want to retreive

foo = 68465d1wd

bar = path.html

12
  • 1
    can you put some example code? seems like you can use split() to do that Commented Jun 19, 2013 at 14:33
  • 1
    What resource are you using for your regex information? The techniques you need are all pretty straightforward. Use parentheses for the relevant text and ` *` for variable white space. You just need to figure out how to separate your folder path from the non-relevant text. Commented Jun 19, 2013 at 14:36
  • @Twissell edited in an example Commented Jun 19, 2013 at 14:36
  • @PeterAlfvin I have no experience in using regex. Commented Jun 19, 2013 at 14:37
  • 1
    Are you sure these are multiple spaces and not tabs, by the way? Commented Jun 19, 2013 at 17:38

2 Answers 2

1

If your fields are separated by at least two spaces, this should do it:

import re
foo, bar, _irrelevant = re.split(r"\s{2,}",  line)

Edit: The above solution no longer works for the revised answer. If (as I gather from your comments) the filename always has a .php or .htm[l] extension, and there's always a path before the final filename, you can try your luck with the following:

foo, rest = re.split(r"\s{2,}",  line, 1)
bar = re.search(r"[^\\]*\.(?:php|html?)\b", rest).group(0)

This will give you everything after the last backslash preceding .php, .htm or .html. Basically there's a regexp for everything, but you need to figure out what your data looks like.

Sign up to request clarification or add additional context in comments.

5 Comments

What if the filename contains two or more spaces in it?
here's a solution: foo, bar = (re.split(r"\s{2,}", line)[0], re.split(r"\s{2,}", line)[-2])
Two or more adjacent spaces in the filename? Then either the input is ambiguous, or there are other constraints we can leverage (e.g., if the filename always has an extension, the irrelevant text never contains backslashes, etc.))
@Alexis, yes, as far as I know, if one space is allowed in a filename, then two adjacent spaces are allowed, although obviously that would be rare. BTW, it wasn't me who downvoted your answer. :-)
They're allowed, sure, but how often does this happen? I think the OP needs to tell us what the data actually looks like.
1
>>> data = '''68465d1wd        C:\nice\ program files\path.html          dw6d5w4dw5d4wd46'''
>>> re.split(r'\s{2,}', data)
['68465d1wd', 'C:\nice\\ program files\\path.html', 'dw6d5w4dw5d4wd46']
>>> foo, bar = re.split(r'\s{2,}', data)[:2]
>>> foo
'68465d1wd'
>>> bar
'C:\nice\\ program files\\path.html'
>>> import os
>>> os.path.basename(bar)
'path.html'

Without regular expression:

>>> foo, rest = data.split(' ', 1)
>>> bar, rest = rest.lstrip().split('  ', 1)
>>> foo
'68465d1wd'
>>> bar
'C:\nice\\ program files\\path.html'
>>> os.path.basename(bar)
'path.html'

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.