How do I capture multiple patterns from a line in python?

Question

Data looks like:

text textext text a 111.222.222.111(123) -> 22.222.111.111(7895)
txt txt txxt text b 22.111.22.222(8153) -> 153.33.233.111(195)
text text txt txt c 222.30.233.121 -> 44.233.111.111
txt text txt text d 22.111.22.222 -> 153.33.233.111

I want to capture a, b, and c along with the two IPs on that line. I do not want the numbers in parentheses that are attached to some of the IPs.

I want my output to look something like this:

a 111.222.222.111 22.222.111.111
b 22.111.22.222 153.33.233.111
c 222.30.233.121 44.233.111.111

What the code looks like:

f=gzip.open(path+Fname,'rb')
for line in f:
    IP_info=re.findall( r'(a|b|c)\s+([0-9]+(?:\.[0-9]+){3})+[ -> ]+([0-9]+(?:\.[0-9]+){3})', line )
    print IP_info
f.flose

What my out put actually looks like:

[('a', '111.222.222.111', '2.222.111.111')]
[('b',  '22.111.22.222', '3.33.233.111')]

The two biggest problems I'm having:

1) The second IP in the output is not complete. The first two digits have been truncated.

2) I am not capturing information for "c".

this is what I use to test my regex's regex101.com

rbp
– rbp

2016-01-14 16:01:14 +00:00
Commented Jan 14, 2016 at 16:01 — rbp
– rbp, Commented Jan 14, 2016 at 16:01

Wiktor Stribiżew · Accepted Answer · 2016-01-14 16:09:03Z

2

Here is a regex that you can use:

\b([abcd])\s+([0-9]+(?:\.[0-9]+){3})(?:\(\d+\))? +-> +([0-9]+(?:\.[0-9]+){3})

See regex demo

There are several points of interest here:

I replaced your [ -> ]+ with +-> + since you meant to match a sequence of characters, not just single characters in various order. Note that -> in the character class created a range, from space to > and that included special symbols, punctuation, AND digits, too. That is why your IPs were partially "eaten".
Since there are optional numbers in parentheses after an IP, I added an optional non-capturing group (?:\(\d+\))? after the first IP
You did not match d in the first capturing group (that I transformed into a character class since I see just single letters - if these are "placeholders", please revert to a group with alternatives - (a|b|c|d)).

See Python demo:

import re
p = re.compile(r'\b([abcd])\s+([0-9]+(?:\.[0-9]+){3})(?:\(\d+\))? +-> +([0-9]+(?:\.[0-9]+){3})')
test_str = "text textext text a 111.222.222.111(123) -> 22.222.111.111(7895)\ntxt txt txxt text b 22.111.22.222(8153) -> 153.33.233.111(195)\ntext text txt txt c 222.30.233.121 -> 44.233.111.111\ntxt text txt text d 22.111.22.222 -> 153.33.233.111"
for x in test_str.split("\n"):
    print(re.findall(p, x))

Output:

[('a', '111.222.222.111', '22.222.111.111')]
[('b', '22.111.22.222', '153.33.233.111')]
[('c', '222.30.233.121', '44.233.111.111')]
[('d', '22.111.22.222', '153.33.233.111')]

edited Jan 14, 2016 at 16:09

answered Jan 14, 2016 at 16:01

Wiktor Stribiżew

631k41 gold badges502 silver badges632 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

rbp Over a year ago

the regex fu is strong in this one

Collectives™ on Stack Overflow

How do I capture multiple patterns from a line in python?

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related