7

Python 3.6

I'd like to remove a list of strings from a string. Here is my first poor attempt:

string = 'this is a test string'
items_to_remove = ['this', 'is', 'a', 'string']
result = list(filter(lambda x: x not in items_to_remove, string.split(' ')))
print(result)

output:

['test']

But this doesn't work if x isn't nicely spaced. I feel there must be a builtin solution, hmm There must be a better way!

I've had a look at this discussion on stack overflow, exact question as mine...

Not to waste my efforts. I timed all the solutions. I believe the easiest, fastest and most pythonic is the simple for loop. Which was not the conclusion in the other post...

result = string
for i in items_to_remove:
    result = result.replace(i,'')

Test Code:

import timeit

t1 = timeit.timeit('''
string = 'this is a test string'
items_to_remove = ['this', 'is', 'a', 'string']
result = list(filter(lambda x: x not in items_to_remove, string.split(' ')))
''', number=1000000)
print(t1)

t2 = timeit.timeit('''
string = 'this is a test string'
items_to_remove = ['this', 'is', 'a', 'string']
def sub(m):
    return '' if m.group() in items_to_remove else m.group()

result = re.sub(r'\w+', sub, string)
''',setup= 'import re', number=1000000)
print(t2)

t3 = timeit.timeit('''
string = 'this is a test string'
items_to_remove = ['this', 'is', 'a', 'string']
result = re.sub(r'|'.join(items_to_remove), '', string)
''',setup= 'import re', number=1000000)
print(t3)

t4 = timeit.timeit('''
string = 'this is a test string'
items_to_remove = ['this', 'is', 'a', 'string']
result = string
for i in items_to_remove:
    result = result.replace(i,'')
''', number=1000000)
print(t4)

outputs:

1.9832003884248448
4.408749988641971
2.124719851741177
1.085117268194475
2
  • 1
    There is a difference between those solutions - some will take into account full words, while others, like the for loop, will replace substrings as well. Try changing the order of your items_to_remove to: ['is', 'this', 'a', 'string'] and you'll see what I'm talking about. Commented Jul 22, 2017 at 13:32
  • Ohh thats a great point! Commented Jul 22, 2017 at 13:38

1 Answer 1

6

You can use string.split() if you aren't confident of your string spacing.

string.split() and string.split(' ') work a little differently:

In [128]: 'this     is   a test'.split()
Out[128]: ['this', 'is', 'a', 'test']

In [129]: 'this     is   a test'.split(' ')
Out[129]: ['this', '', '', '', '', 'is', '', '', 'a', 'test']

The former splits your string without any redundant empty strings.

If you want a little more security, or if your strings could contain tabs and newlines, there's another solution with regex:

In [131]: re.split('[\s]+',  'this     is \t  a\ntest', re.M)
Out[131]: ['this', 'is', 'a', 'test']

Lastly, I would suggest converting your lookup list into a lookup set for efficient lookup in your filter:

In [135]: list(filter(lambda x: x not in {'is', 'this', 'a', 'string'}, string.split()))
Out[135]: ['test']

While on the topic of performance, a list comp is a bit faster than a filter, although less concise:

In [136]: [x for x in string.split() if x not in {'is', 'this', 'a', 'string'}]
Out[136]: ['test']
Sign up to request clarification or add additional context in comments.

1 Comment

That's gold, couple of subtle things i didn't consider

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.