1

Data looks like:

text textext text a 111.222.222.111(123) -> 22.222.111.111(7895)
txt txt txxt text b 22.111.22.222(8153) -> 153.33.233.111(195)
text text txt txt c 222.30.233.121 -> 44.233.111.111
txt text txt text d 22.111.22.222 -> 153.33.233.111

I want to capture a, b, and c along with the two IPs on that line. I do not want the numbers in parentheses that are attached to some of the IPs.

I want my output to look something like this:

a 111.222.222.111 22.222.111.111
b 22.111.22.222 153.33.233.111
c 222.30.233.121 44.233.111.111

What the code looks like:

f=gzip.open(path+Fname,'rb')
for line in f:
    IP_info=re.findall( r'(a|b|c)\s+([0-9]+(?:\.[0-9]+){3})+[ -> ]+([0-9]+(?:\.[0-9]+){3})', line )
    print IP_info
f.flose

What my out put actually looks like:

[('a', '111.222.222.111', '2.222.111.111')]
[('b',  '22.111.22.222', '3.33.233.111')]

The two biggest problems I'm having:

1) The second IP in the output is not complete. The first two digits have been truncated.

2) I am not capturing information for "c".

1
  • this is what I use to test my regex's regex101.com Commented Jan 14, 2016 at 16:01

1 Answer 1

2

Here is a regex that you can use:

\b([abcd])\s+([0-9]+(?:\.[0-9]+){3})(?:\(\d+\))? +-> +([0-9]+(?:\.[0-9]+){3})

See regex demo

There are several points of interest here:

  • I replaced your [ -> ]+ with +-> + since you meant to match a sequence of characters, not just single characters in various order. Note that -> in the character class created a range, from space to > and that included special symbols, punctuation, AND digits, too. That is why your IPs were partially "eaten".
  • Since there are optional numbers in parentheses after an IP, I added an optional non-capturing group (?:\(\d+\))? after the first IP
  • You did not match d in the first capturing group (that I transformed into a character class since I see just single letters - if these are "placeholders", please revert to a group with alternatives - (a|b|c|d)).

See Python demo:

import re
p = re.compile(r'\b([abcd])\s+([0-9]+(?:\.[0-9]+){3})(?:\(\d+\))? +-> +([0-9]+(?:\.[0-9]+){3})')
test_str = "text textext text a 111.222.222.111(123) -> 22.222.111.111(7895)\ntxt txt txxt text b 22.111.22.222(8153) -> 153.33.233.111(195)\ntext text txt txt c 222.30.233.121 -> 44.233.111.111\ntxt text txt text d 22.111.22.222 -> 153.33.233.111"
for x in test_str.split("\n"):
    print(re.findall(p, x))

Output:

[('a', '111.222.222.111', '22.222.111.111')]
[('b', '22.111.22.222', '153.33.233.111')]
[('c', '222.30.233.121', '44.233.111.111')]
[('d', '22.111.22.222', '153.33.233.111')]
Sign up to request clarification or add additional context in comments.

1 Comment

the regex fu is strong in this one

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.