remove multiple substrings inside a string

Question

Here's hoping somebody can shed some light on this question because it has me stumped. I have a string that looks like this:

s = "abcdef [[xxxx xxx|ghijk]] lmnop [[qrs]] tuv [[xx xxxx|wxyz]] 0123456789"

I want this result:

abcdef ghijk lmnop qrs tuv wxyz 0123456789

Having reviewed numerous questions and answers here, the closest I have come to a solution is:

s = "abcdef [[xxxx xxx|ghijk]] lmnop [[qrs]] tuv [[xx xxxx|wxyz]] 0123456789"
s = re.sub('\[\[.*?\|', '', s)
s = re.sub('[\]\]]', '', s)
--> abcdef ghijk lmnop wxyz 0123456789

Since not every substring within double brackets contains a pipe, the re.sub removes everything from '[[' to next '|' instead of checking within each set of double brackets.

Any assistance would be most appreciated.

Can you please describe the discriminating factor, i.e., why those particular substrings? — TigerhawkT3
– TigerhawkT3, Commented Jun 30, 2015 at 21:35
Try using raw strings for your regular expressions, like r'\[\[.*?\|'. — Spice
– Spice, Commented Jun 30, 2015 at 21:36
@TigerhawkT3 in my text whenever a pipe occurs between a set of double brackets, everything before the pipe is a description I don't need in the final result. — Pickle
– Pickle, Commented Jun 30, 2015 at 22:05

Sede · Accepted Answer · 2015-06-30 22:19:26Z

1

What about this:

In [187]: re.sub(r'([\[|\]])|((?<=\[)\w+\s+\w+(?=|))', '', s)
Out[187]: 'abcdef ghijk lmnop qrs tuv wxyz 0123456789'

edited Jun 30, 2015 at 22:19

answered Jun 30, 2015 at 21:40

Sede

61.5k20 gold badges158 silver badges162 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Pickle Over a year ago

Should have specified that the x's between '[[' and the '|' represent a substring unknown to me. How can I alter your suggested regex to include any character?

fronthem · Accepted Answer · 2015-06-30 22:37:56Z

1

I purpose you a contrary method, instead of remove it you can just catch patterns you want. I think this way can make your code more semantics.

There are two patterns you wish to catch:

Case: words outside [[...]]

Pattern: Any words are either leaded by ']] ' or trailed by ' [['.

Regex: (?<=\]\]\s)\w+|\w+(?=\s\[\[)
Case: words inside [[...]]

Pattern: Any words are trailed by ']]'

Regex: \w+(?=\]\])

Example code

1 #!/usr/bin/env python
2 import re
3
4 s = "abcdef [[xxxx xxx|ghijk]] lmnop [[qrs]] tuv [[xx xxxx|wxyz]] 0123456789    "
5
6 p = re.compile('(?<=\]\]\s)\w+|\w+(?=\s\[\[)|\w+(?=\]\])')
7 print p.findall(s)

Result:

['abcdef', 'ghijk', 'lmnop', 'qrs', 'tuv', 'wxyz', '0123456789']

edited Jun 30, 2015 at 22:37

answered Jun 30, 2015 at 21:57

fronthem

4,1418 gold badges39 silver badges60 bronze badges

Comments

TigerhawkT3 · Accepted Answer · 2015-06-30 22:26:59Z

0

>>> import re
>>> s = "abcdef [[xxxx xxx|ghijk]] lmnop [[qrs]] tuv [[xx xxxx|wxyz]] 0123456789"
>>> re.sub(r'(\[\[[^]]+?\|)|([\[\]])', '', s)
'abcdef ghijk lmnop qrs tuv wxyz 0123456789'

This searches for and removes the following two items:

Two opening brackets followed by a bunch of stuff that isn't a closing bracket followed by a pipe.
Opening or closing brackets.

answered Jun 30, 2015 at 22:26

TigerhawkT3

49.5k6 gold badges66 silver badges101 bronze badges

1 Comment

Pickle Over a year ago

just ran your solution through a sample of my file and it worked with every entry. Comes after a lot of frustration on the issue. Many thanks!

Kasravnd · Accepted Answer · 2015-06-30 22:30:15Z

0

As a general regex using built-in re module you can use follwing regex that used look-around:

(?<!\[\[)\b([\w]+)\b(?!\|)|\[\[([^|]*)\]\]

you can use re.finditer to get the desire result :

>>> g=re.finditer(r'(?<!\[\[)\b([\w]+)\b(?!\|)|(?<=\[\[)[^|]*(?=\]\])',s)
>>> [j.group() for j in g]
['abcdef', 'ghijk', 'lmnop', 'qrs', 'tuv', 'wxyz', '0123456789']

The preceding regex contains from 2 part one is :

(?<=\[\[)[^|]*(?=\]\])

which match any combinations of word characters that not followed by | and not precede by [[.

the second part is :

\[\[([^|]*)\]\]

that will match any thing between 2 brackets except |.

edited Jun 30, 2015 at 22:30

answered Jun 30, 2015 at 22:04

Kasravnd

108k19 gold badges167 silver badges195 bronze badges

Collectives™ on Stack Overflow

remove multiple substrings inside a string

4 Answers 4

1 Comment

Comments

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

1 Comment

Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related