Extracting two values from python regex

Question

I've got a file formatted like this:

3 name1
2    name2
1    name3

The space between the number and the name can be one or several spaces, or any number of tabs.

I'm trying to find a way to match this line with a regex and extract the number and the name in a list or tuple.

I could write this in several lines, but I'd rather have one clean line that can both recognize tabs and whitespace and give me my values. I've been unsuccessful in doing that.

edit: I've tried using re.search('^[\d]+[\s|\t]+.*', line) to match any number of digits, either spaces or tabs and then anything. But this doesn't work - presumably because I'm not telling it what to extract for me.

tabs are whitespace. \s|\t is redundant. Also I don't think you know what [ ] does. — Falmarri
– Falmarri, Commented Feb 10, 2015 at 0:04
@Falmarri Actually, it's bugged, rather than just redundant. It allows the pipe character | to be matched: bool(re.search('[\s|\t]+', ' | ')) (that's a bunch of spaces with a | in the middle) gives True. — jpmc26
– jpmc26, Commented Feb 10, 2015 at 0:46
@jpmc26: Interesting. Does the | character always mean a literal in a character class? Or is this a bug in python's regex engine. — Falmarri
– Falmarri, Commented Feb 16, 2015 at 22:33

John1024 · Accepted Answer · 2015-02-10 00:10:23Z

5

All you need to do is add parens around what you want to capture:

>>> line='1\t abc'
>>> re.search('^(\d+)\s+(.*)', line).groups()
('1', 'abc')

Incidentally, notice that the regex that you used starts with a ^ which matches only at the beginning of a line. Consequently, match can be used in place of search here:

>>> re.match('(\d+)\s+(.*)', line).groups()
('1', 'abc')

edited Feb 10, 2015 at 0:10

answered Feb 10, 2015 at 0:04

John1024

115k15 gold badges151 silver badges183 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Padraic Cunningham · Accepted Answer · 2015-02-10 01:29:17Z

You don't need a regex at all, you can str.split it does not matter if you have 1 or 21 spaces between:

lines="""3 name1
2    name2
1    name3"""

for line in lines.splitlines():
    num, name = line.split()
    print(num,name)
3 name1
2 name2
1 name3

In a list comp:

print([line.split() for line in lines.splitlines()])
[['3', 'name1'], ['2', 'name2'], ['1', 'name3']]

replace the lines.splitlines() with your file object in your own code.

Using a regex to split on whitespace is not a very good approach:

In [13]: timeit re.search('^(\d+)\s+(.*)', line).groups()
1000000 loops, best of 3: 2.04 µs per loop

In [14]: timeit line.split()
1000000 loops, best of 3: 222 ns per loop
Out[15]: ('1', 'abc')
In [16]: line.split()
Out[16]: ['1', 'abc']

split does the exact same thing in just over a tenth of the time.

Even if there are more than two values you can split and extract the first two:

lines="""3 name1 foo
2    name2  bar
1    name3 foobar """


print( [line.split(None, 2)[:2] for line in lines.splitlines()])
[['3', 'name1'], ['2', 'name2'], ['1', 'name3']]

Collectives™ on Stack Overflow

Extracting two values from python regex

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related