1

I've got a file formatted like this:

3 name1
2    name2
1    name3

The space between the number and the name can be one or several spaces, or any number of tabs.

I'm trying to find a way to match this line with a regex and extract the number and the name in a list or tuple.

I could write this in several lines, but I'd rather have one clean line that can both recognize tabs and whitespace and give me my values. I've been unsuccessful in doing that.

edit: I've tried using re.search('^[\d]+[\s|\t]+.*', line) to match any number of digits, either spaces or tabs and then anything. But this doesn't work - presumably because I'm not telling it what to extract for me.

6
  • Can you share what you've tried thus far? Commented Feb 9, 2015 at 23:57
  • @ZacharyCross, sure - added above Commented Feb 10, 2015 at 0:02
  • tabs are whitespace. \s|\t is redundant. Also I don't think you know what [ ] does. Commented Feb 10, 2015 at 0:04
  • @Falmarri Actually, it's bugged, rather than just redundant. It allows the pipe character | to be matched: bool(re.search('[\s|\t]+', ' | ')) (that's a bunch of spaces with a | in the middle) gives True. Commented Feb 10, 2015 at 0:46
  • @jpmc26: Interesting. Does the | character always mean a literal in a character class? Or is this a bug in python's regex engine. Commented Feb 16, 2015 at 22:33

2 Answers 2

5

All you need to do is add parens around what you want to capture:

>>> line='1\t abc'
>>> re.search('^(\d+)\s+(.*)', line).groups()
('1', 'abc')

Incidentally, notice that the regex that you used starts with a ^ which matches only at the beginning of a line. Consequently, match can be used in place of search here:

>>> re.match('(\d+)\s+(.*)', line).groups()
('1', 'abc')
Sign up to request clarification or add additional context in comments.

Comments

3

You don't need a regex at all, you can str.split it does not matter if you have 1 or 21 spaces between:

lines="""3 name1
2    name2
1    name3"""

for line in lines.splitlines():
    num, name = line.split()
    print(num,name)
3 name1
2 name2
1 name3

In a list comp:

print([line.split() for line in lines.splitlines()])
[['3', 'name1'], ['2', 'name2'], ['1', 'name3']]

replace the lines.splitlines() with your file object in your own code.

Using a regex to split on whitespace is not a very good approach:

In [13]: timeit re.search('^(\d+)\s+(.*)', line).groups()
1000000 loops, best of 3: 2.04 µs per loop

In [14]: timeit line.split()
1000000 loops, best of 3: 222 ns per loop
Out[15]: ('1', 'abc')
In [16]: line.split()
Out[16]: ['1', 'abc']

split does the exact same thing in just over a tenth of the time.

Even if there are more than two values you can split and extract the first two:

lines="""3 name1 foo
2    name2  bar
1    name3 foobar """


print( [line.split(None, 2)[:2] for line in lines.splitlines()])
[['3', 'name1'], ['2', 'name2'], ['1', 'name3']]

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.