python regex find matched string

Question

I am trying to find the matched string in a string using regex in Python. The string looks like this:

band   1 # energy  -53.15719532 # occ.  2.00000000

ion      s      p      d    tot
  1  0.000  0.995  0.000  0.995
  2  0.000  0.000  0.000  0.000
tot  0.000  0.996  0.000  0.996

band   2 # energy  -53.15719532 # occ.  2.00000000

ion      s      p      d    tot
  1  0.000  0.995  0.000  0.995
  2  0.000  0.000  0.000  0.000
tot  0.000  0.996  0.000  0.996

band   3 # energy  -53.15719532 # occ.  2.00000000

My goal is to find the string after tot. So the matched string will be something like:

['0.000  0.996  0.000  0.996', 
'0.000  0.996  0.000  0.996']

Here is my current code:

pattern = re.compile(r'tot\s+(.*?)\n', re.DOTALL)
pattern.findall(string)

However, the output gives me:

['1  0.000  0.995  0.000  0.995',
 '0.000  0.996  0.000  0.996',
 '1  0.000  0.995  0.000  0.995',
 '0.000  0.996  0.000  0.996']

Any idea of what I am doing wrong?

Tomalak · Accepted Answer · 2016-09-04 18:11:30Z

4

You don't want the DOTALL flag. Remove it and use MULTILINE instead.

pattern = re.compile(r'^\s*tot(.*)', re.MULTILINE)

This matches all lines that start with tot. The rest of the line will be in group 1.

Citing the documentation, emphasis mine:

re.DOTALL

Make the '.' special character match any character at all, including a newline; without this flag, '.' will match anything except a newline.

Note that you can easily do this without regex.

with open("input.txt", "r") as data_file:
    for line in data_file:
        items = filter(None, line.split(" "))
        if items[0] == "tot":
            # etc

edited Sep 4, 2016 at 18:11

answered Sep 4, 2016 at 18:02

Tomalak

339k68 gold badges547 silver badges635 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Jianli Cheng Over a year ago

This solves my problem. I think I am confused about DOTALL and MUTILINE. Need to read more about it.

RomanPerekhrest · Accepted Answer · 2016-09-04 18:14:19Z

1

The alternative solution using re.findall function with specific regex pattern:

# str is your inital string
result = re.findall('tot [0-9 .]+(?=\n|$)', str)
print(result)

The output:

['tot  0.000  0.996  0.000  0.996', 'tot  0.000  0.996  0.000  0.996']

edited Sep 4, 2016 at 18:14

answered Sep 4, 2016 at 18:09

RomanPerekhrest

93.1k4 gold badges75 silver badges112 bronze badges

Comments

mpurg · Accepted Answer · 2016-09-04 18:16:41Z

1

You are using re.DOTALL, which means that the dot "." will match anything, even newlines, in essence finding both "tot"-s and everything that follows until the next newline:

                            tot
  1  0.000  0.995  0.000  0.995

and

tot  0.000  0.996  0.000  0.996

Removing re.DOTALL should fix your problem.

Edit: Actually, the DOTALL flag is not really the issue (though unnecessary). The problem in the pattern is that the \s+ matches the newline. Replacing that with a single space solves that issue:

pattern = re.compile(r'tot (.*?)\n')

edited Sep 4, 2016 at 18:16

answered Sep 4, 2016 at 18:06

mpurg

2111 silver badge6 bronze badges

3 Comments

Jianli Cheng Over a year ago

I think I should change DOTALL to MULTILINE as @Tomalak suggested

mpurg Over a year ago

MULTILINE is not needed here, unless you would want to make use of ^ and $ to match beginning and end of a line, respectively. I have to point out that @Tomalak's solution is cleaner.

Jianli Cheng Over a year ago

You're right. \s+ is actually the problem here. I though it only means more than one whitespaces. Thanks for letting me know.

Collectives™ on Stack Overflow

python regex find matched string

3 Answers 3

1 Comment

Comments

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related