How to use regex to delete specific pattern in lines in python string?

Question

I have a string in the following format.

Test 2 Lorem ipsum dolor sit amet consectetur adipisicing elit.
Test 3 Lorem ipsum dolor sit amet consectetur adipisicing elit.
Test 4 Lorem ipsum dolor sit amet consectetur adipisicing elit.
Test 5 Lorem ipsum dolor sit amet consectetur adipisicing elit.

How to delete Test 2, Test 3 and so on, so the string would look like this?

Lorem ipsum dolor sit amet consectetur adipisicing elit.
Lorem ipsum dolor sit amet consectetur adipisicing elit.
Lorem ipsum dolor sit amet consectetur adipisicing elit.
Lorem ipsum dolor sit amet consectetur adipisicing elit.
Lorem ipsum dolor sit amet consectetur adipisicing elit.

I have tried:

test1 = re.compile(r'^Test \d ')
test2 = re.compile(r'^Test \d\d ')
text = re.sub(test1, '', text)
text = re.sub(test2, '', text)

But it didn't work

What do you mean by "it didn't work"? At least for a single line, text = re.sub(test1, '', text) does what you want. — mkrieger1
– mkrieger1, Commented Jun 14, 2021 at 18:11

mkrieger1 · Accepted Answer · 2021-06-14 18:31:48Z

Assuming that you have a single multi-line string, then

test1 = re.compile(r'^Test \d ')
text = re.sub(test1, '', text)

does in fact remove Test 2 from the first line of the string, but does not change all other lines, because ^ matches the beginning of the whole string, and not the beginning of each line.

You can change that by using the re.M flag:

When specified, the pattern character '^' matches at the beginning of the string and at the beginning of each line

>>> test1 = re.compile(r'^Test \d ', flags=re.M)
>>> text = '''\
... Test 2 Lorem ipsum dolor sit amet consectetur adipisicing elit.
... Test 3 Lorem ipsum dolor sit amet consectetur adipisicing elit.
... Test 4 Lorem ipsum dolor sit amet consectetur adipisicing elit.
... Test 5 Lorem ipsum dolor sit amet consectetur adipisicing elit.
... '''
>>> print(re.sub(test1, '', text))
Lorem ipsum dolor sit amet consectetur adipisicing elit.
Lorem ipsum dolor sit amet consectetur adipisicing elit.
Lorem ipsum dolor sit amet consectetur adipisicing elit.
Lorem ipsum dolor sit amet consectetur adipisicing elit.

Alternatively, split the string in into lines and apply your original pattern without re.M to each line separately:

>>> test1 = re.compile(r'^Test \d ')
>>> [re.sub(test1, '', line) for line in text.splitlines()]
['Lorem ipsum dolor sit amet consectetur adipisicing elit.',
 'Lorem ipsum dolor sit amet consectetur adipisicing elit.',
 'Lorem ipsum dolor sit amet consectetur adipisicing elit.',
 'Lorem ipsum dolor sit amet consectetur adipisicing elit.']

Depending on whether you want to continue processing the text as a whole, or each line separately (or maybe you already have each line separately as input to your program), one or the other option may be more practical.

The test1 pattern works only for single-digit numbers after 'Test ' and the test2 pattern works only for two-digit numbers. To make it work for any number of digits, change \d or \d\d to \d+.

RavinderSingh13 · Accepted Answer · 2021-06-17 16:27:51Z

3

Based on your shown samples, please try following. This will work even if you are having 1 or more occurrences of Test digit from starting of your value.

import re
var="""Test 2 Lorem ipsum dolor sit amet consectetur adipisicing elit.
Test 3 Lorem ipsum dolor sit amet consectetur adipisicing elit.
Test 4 Lorem ipsum dolor sit amet consectetur adipisicing elit.
Test 5 Lorem ipsum dolor sit amet consectetur adipisicing elit."""
print (re.sub(r'^(Test\s+\d+)(\s+Test\s+\d+)*\s*', '', var, flags=re.M))

Explanation: Using Python's re library here. Then using re.sub function of Python. Giving regex inside it to substitute matched value with NULL in var(variable).

Explanation of regex:

^(Test\s+\d+)       ##From starting of value, matching Test followed by 1 or more spaces followed by 1 or more digits.
(\s+Test\s+\d+)*    ##Matching 1 or more spaces followed by Test, followed by 1 or more spaces, followed by 1 or more occurrences of digits. matching 0 or more occurrences of this regex.
\s*                 ##Matching 0 or more occurrences of spaces here.

edited Jun 17, 2021 at 16:27

answered Jun 14, 2021 at 18:02

RavinderSingh13

135k14 gold badges61 silver badges100 bronze badges

4 Comments

mkrieger1 Over a year ago

Why do you use Test\s+\d+ twice in a row?

RavinderSingh13 Over a year ago

@mkrieger1, 1st one is mandatory match, 2nd one is optional in case OP has 1 or more occurrences of Test digit to catch them.

Arvind Kumar Avinash Over a year ago

Good answer but I see a problem here. You have done the same thing in two different ways which may be confusing for a beginner. {0,} is the same as * and therefore you should use the same thing at both places to keep your pattern consistent.

RavinderSingh13 Over a year ago

@ArvindKumarAvinash, sure thank you, even I also thought that(before posting) but I thought to keep it this way, yes one could use * too, cheers.

Collectives™ on Stack Overflow

How to use regex to delete specific pattern in lines in python string?

2 Answers 2

Comments

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related