Python best way to remove multiple strings from string

Question

Python 3.6

I'd like to remove a list of strings from a string. Here is my first poor attempt:

string = 'this is a test string'
items_to_remove = ['this', 'is', 'a', 'string']
result = list(filter(lambda x: x not in items_to_remove, string.split(' ')))
print(result)

output:

['test']

But this doesn't work if x isn't nicely spaced. I feel there must be a builtin solution, hmm There must be a better way!

I've had a look at this discussion on stack overflow, exact question as mine...

Not to waste my efforts. I timed all the solutions. I believe the easiest, fastest and most pythonic is the simple for loop. Which was not the conclusion in the other post...

result = string
for i in items_to_remove:
    result = result.replace(i,'')

Test Code:

import timeit

t1 = timeit.timeit('''
string = 'this is a test string'
items_to_remove = ['this', 'is', 'a', 'string']
result = list(filter(lambda x: x not in items_to_remove, string.split(' ')))
''', number=1000000)
print(t1)

t2 = timeit.timeit('''
string = 'this is a test string'
items_to_remove = ['this', 'is', 'a', 'string']
def sub(m):
    return '' if m.group() in items_to_remove else m.group()

result = re.sub(r'\w+', sub, string)
''',setup= 'import re', number=1000000)
print(t2)

t3 = timeit.timeit('''
string = 'this is a test string'
items_to_remove = ['this', 'is', 'a', 'string']
result = re.sub(r'|'.join(items_to_remove), '', string)
''',setup= 'import re', number=1000000)
print(t3)

t4 = timeit.timeit('''
string = 'this is a test string'
items_to_remove = ['this', 'is', 'a', 'string']
result = string
for i in items_to_remove:
    result = result.replace(i,'')
''', number=1000000)
print(t4)

outputs:

1.9832003884248448
4.408749988641971
2.124719851741177
1.085117268194475

There is a difference between those solutions - some will take into account full words, while others, like the for loop, will replace substrings as well. Try changing the order of your items_to_remove to: ['is', 'this', 'a', 'string'] and you'll see what I'm talking about. — zwer
– zwer, Commented Jul 22, 2017 at 13:32

cs95 · Accepted Answer · 2017-07-22 13:45:56Z

6

You can use string.split() if you aren't confident of your string spacing.

string.split() and string.split(' ') work a little differently:

In [128]: 'this     is   a test'.split()
Out[128]: ['this', 'is', 'a', 'test']

In [129]: 'this     is   a test'.split(' ')
Out[129]: ['this', '', '', '', '', 'is', '', '', 'a', 'test']

The former splits your string without any redundant empty strings.

If you want a little more security, or if your strings could contain tabs and newlines, there's another solution with regex:

In [131]: re.split('[\s]+',  'this     is \t  a\ntest', re.M)
Out[131]: ['this', 'is', 'a', 'test']

Lastly, I would suggest converting your lookup list into a lookup set for efficient lookup in your filter:

In [135]: list(filter(lambda x: x not in {'is', 'this', 'a', 'string'}, string.split()))
Out[135]: ['test']

While on the topic of performance, a list comp is a bit faster than a filter, although less concise:

In [136]: [x for x in string.split() if x not in {'is', 'this', 'a', 'string'}]
Out[136]: ['test']

edited Jul 22, 2017 at 13:45

answered Jul 22, 2017 at 13:35

cs95

406k106 gold badges744 silver badges797 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

James Schinner Over a year ago

That's gold, couple of subtle things i didn't consider

Collectives™ on Stack Overflow

Python best way to remove multiple strings from string

1 Answer 1

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related