3

I have a string in the following format.

Test 2 Lorem ipsum dolor sit amet consectetur adipisicing elit.
Test 3 Lorem ipsum dolor sit amet consectetur adipisicing elit.
Test 4 Lorem ipsum dolor sit amet consectetur adipisicing elit.
Test 5 Lorem ipsum dolor sit amet consectetur adipisicing elit.

How to delete Test 2, Test 3 and so on, so the string would look like this?

Lorem ipsum dolor sit amet consectetur adipisicing elit.
Lorem ipsum dolor sit amet consectetur adipisicing elit.
Lorem ipsum dolor sit amet consectetur adipisicing elit.
Lorem ipsum dolor sit amet consectetur adipisicing elit.
Lorem ipsum dolor sit amet consectetur adipisicing elit.

I have tried:

test1 = re.compile(r'^Test \d ')
test2 = re.compile(r'^Test \d\d ')
text = re.sub(test1, '', text)
text = re.sub(test2, '', text)

But it didn't work

2
  • re.sub(r'^Test\s+\d+\s*', '', text, flags=re.M)? Commented Jun 14, 2021 at 17:58
  • What do you mean by "it didn't work"? At least for a single line, text = re.sub(test1, '', text) does what you want. Commented Jun 14, 2021 at 18:11

2 Answers 2

3

Assuming that you have a single multi-line string, then

test1 = re.compile(r'^Test \d ')
text = re.sub(test1, '', text)

does in fact remove Test 2 from the first line of the string, but does not change all other lines, because ^ matches the beginning of the whole string, and not the beginning of each line.

You can change that by using the re.M flag:

When specified, the pattern character '^' matches at the beginning of the string and at the beginning of each line

>>> test1 = re.compile(r'^Test \d ', flags=re.M)
>>> text = '''\
... Test 2 Lorem ipsum dolor sit amet consectetur adipisicing elit.
... Test 3 Lorem ipsum dolor sit amet consectetur adipisicing elit.
... Test 4 Lorem ipsum dolor sit amet consectetur adipisicing elit.
... Test 5 Lorem ipsum dolor sit amet consectetur adipisicing elit.
... '''
>>> print(re.sub(test1, '', text))
Lorem ipsum dolor sit amet consectetur adipisicing elit.
Lorem ipsum dolor sit amet consectetur adipisicing elit.
Lorem ipsum dolor sit amet consectetur adipisicing elit.
Lorem ipsum dolor sit amet consectetur adipisicing elit.

Alternatively, split the string in into lines and apply your original pattern without re.M to each line separately:

>>> test1 = re.compile(r'^Test \d ')
>>> [re.sub(test1, '', line) for line in text.splitlines()]
['Lorem ipsum dolor sit amet consectetur adipisicing elit.',
 'Lorem ipsum dolor sit amet consectetur adipisicing elit.',
 'Lorem ipsum dolor sit amet consectetur adipisicing elit.',
 'Lorem ipsum dolor sit amet consectetur adipisicing elit.']

Depending on whether you want to continue processing the text as a whole, or each line separately (or maybe you already have each line separately as input to your program), one or the other option may be more practical.

The test1 pattern works only for single-digit numbers after 'Test ' and the test2 pattern works only for two-digit numbers. To make it work for any number of digits, change \d or \d\d to \d+.

Sign up to request clarification or add additional context in comments.

Comments

3

Based on your shown samples, please try following. This will work even if you are having 1 or more occurrences of Test digit from starting of your value.

import re
var="""Test 2 Lorem ipsum dolor sit amet consectetur adipisicing elit.
Test 3 Lorem ipsum dolor sit amet consectetur adipisicing elit.
Test 4 Lorem ipsum dolor sit amet consectetur adipisicing elit.
Test 5 Lorem ipsum dolor sit amet consectetur adipisicing elit."""
print (re.sub(r'^(Test\s+\d+)(\s+Test\s+\d+)*\s*', '', var, flags=re.M))

Explanation: Using Python's re library here. Then using re.sub function of Python. Giving regex inside it to substitute matched value with NULL in var(variable).

Explanation of regex:

^(Test\s+\d+)       ##From starting of value, matching Test followed by 1 or more spaces followed by 1 or more digits.
(\s+Test\s+\d+)*    ##Matching 1 or more spaces followed by Test, followed by 1 or more spaces, followed by 1 or more occurrences of digits. matching 0 or more occurrences of this regex.
\s*                 ##Matching 0 or more occurrences of spaces here.

4 Comments

Why do you use Test\s+\d+ twice in a row?
@mkrieger1, 1st one is mandatory match, 2nd one is optional in case OP has 1 or more occurrences of Test digit to catch them.
Good answer but I see a problem here. You have done the same thing in two different ways which may be confusing for a beginner. {0,} is the same as * and therefore you should use the same thing at both places to keep your pattern consistent.
@ArvindKumarAvinash, sure thank you, even I also thought that(before posting) but I thought to keep it this way, yes one could use * too, cheers.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.