extract specific text using multiple regex in python?

Question

I have a problem using regular expressions in python 3 so I would be gladful if someone could help me. I have a text file like the one below:

Header A
text text
text text
Header B
text text
text text
Header C
text text
here is the end

what I would like to do is to have a list of the text between the headers but including the headers themselves. I am using this regular expression:

 re.findall(r'(?=(Header.*?Header|Header.*?end))',data, re.DOTALL)

the result is here

['Header A\ntext text\n text text\n Header', 'Header B\ntext text\n text text\n Header', 'Header C\n text text here is the end']

The thing is that I get the next header in the end of the every item in the list. As you can see every header ends when we find the next header but the last header doesn't end in a specific way

Is there a way to get a list (not tuple) of every header including its own text as substrings using regular expressions?

vks · Accepted Answer · 2015-03-12 15:07:30Z

1

Header [^\n]*[\s\S]*?(?=Header|$)

Try this.See demo.

https://regex101.com/r/iS6jF6/21

import re
p = re.compile(r'Header [^\n]*[\s\S]*?(?=Header|$)')
test_str = "Header A\ntext text\ntext text\nHeader B\ntext text\ntext text\nHeader C\ntext text\nhere is the end"

re.findall(p, test_str)

answered Mar 12, 2015 at 15:07

vks

68.1k11 gold badges96 silver badges132 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

bettas Over a year ago

Yes this is correct too!Thank you vks as i told in another comment I think i have to read regex again to understand how they work!

Toto · Accepted Answer · 2015-03-12 15:07:53Z

1

How about:

re.findall(r'(?=(Header.*?)(?=Header|end))',data, re.DOTALL)

answered Mar 12, 2015 at 15:07

Toto

91.7k63 gold badges97 silver badges135 bronze badges

Comments

Avinash Raj · Accepted Answer · 2015-03-12 15:44:18Z

1

You actually need to use a positive lookahead assertion.

>>> s = '''Header A
text text
text text
Header B
text text
text text
Header C
text text
here is the end'''
>>> re.findall(r'Header.*?(?=Header)|Header.*?end',s, re.DOTALL)
['Header A\ntext text\ntext text\n', 'Header B\ntext text\ntext text\n', 'Header C\ntext text\nhere is the end']

Include \n inside the positive lookahead in-order to not to get \n character at the last for each item.

>>> re.findall(r'Header.*?(?=\nHeader)|Header.*?end',s, re.DOTALL)
['Header A\ntext text\ntext text', 'Header B\ntext text\ntext text', 'Header C\ntext text\nhere is the end']

OR

Split your input according to the newline which exists just before to the string Header.

>>> re.split(r'\n(?=Header\b)', s)
['Header A\ntext text\ntext text', 'Header B\ntext text\ntext text', 'Header C\ntext text\nhere is the end']

edited Mar 12, 2015 at 15:44

answered Mar 12, 2015 at 15:08

Avinash Raj

175k32 gold badges247 silver badges289 bronze badges

1 Comment

bettas Over a year ago

Thanks Avinash!That's correct too! I think i have to read more about regex to understand how they work! Thanks a lot!

Collectives™ on Stack Overflow

extract specific text using multiple regex in python?

3 Answers 3

1 Comment

Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related