1

I am trying to find the matched string in a string using regex in Python. The string looks like this:

band   1 # energy  -53.15719532 # occ.  2.00000000

ion      s      p      d    tot
  1  0.000  0.995  0.000  0.995
  2  0.000  0.000  0.000  0.000
tot  0.000  0.996  0.000  0.996

band   2 # energy  -53.15719532 # occ.  2.00000000

ion      s      p      d    tot
  1  0.000  0.995  0.000  0.995
  2  0.000  0.000  0.000  0.000
tot  0.000  0.996  0.000  0.996

band   3 # energy  -53.15719532 # occ.  2.00000000

My goal is to find the string after tot. So the matched string will be something like:

['0.000  0.996  0.000  0.996', 
'0.000  0.996  0.000  0.996']

Here is my current code:

pattern = re.compile(r'tot\s+(.*?)\n', re.DOTALL)
pattern.findall(string)

However, the output gives me:

['1  0.000  0.995  0.000  0.995',
 '0.000  0.996  0.000  0.996',
 '1  0.000  0.995  0.000  0.995',
 '0.000  0.996  0.000  0.996']

Any idea of what I am doing wrong?

3 Answers 3

4

You don't want the DOTALL flag. Remove it and use MULTILINE instead.

pattern = re.compile(r'^\s*tot(.*)', re.MULTILINE)

This matches all lines that start with tot. The rest of the line will be in group 1.

Citing the documentation, emphasis mine:

re.DOTALL

Make the '.' special character match any character at all, including a newline; without this flag, '.' will match anything except a newline.

Note that you can easily do this without regex.

with open("input.txt", "r") as data_file:
    for line in data_file:
        items = filter(None, line.split(" "))
        if items[0] == "tot":
            # etc
Sign up to request clarification or add additional context in comments.

1 Comment

This solves my problem. I think I am confused about DOTALL and MUTILINE. Need to read more about it.
1

The alternative solution using re.findall function with specific regex pattern:

# str is your inital string
result = re.findall('tot [0-9 .]+(?=\n|$)', str)
print(result)

The output:

['tot  0.000  0.996  0.000  0.996', 'tot  0.000  0.996  0.000  0.996']

Comments

1

You are using re.DOTALL, which means that the dot "." will match anything, even newlines, in essence finding both "tot"-s and everything that follows until the next newline:

                            tot
  1  0.000  0.995  0.000  0.995

and

tot  0.000  0.996  0.000  0.996

Removing re.DOTALL should fix your problem.

Edit: Actually, the DOTALL flag is not really the issue (though unnecessary). The problem in the pattern is that the \s+ matches the newline. Replacing that with a single space solves that issue:

pattern = re.compile(r'tot (.*?)\n')

3 Comments

I think I should change DOTALL to MULTILINE as @Tomalak suggested
MULTILINE is not needed here, unless you would want to make use of ^ and $ to match beginning and end of a line, respectively. I have to point out that @Tomalak's solution is cleaner.
You're right. \s+ is actually the problem here. I though it only means more than one whitespaces. Thanks for letting me know.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.