1

Here's hoping somebody can shed some light on this question because it has me stumped. I have a string that looks like this:

s = "abcdef [[xxxx xxx|ghijk]] lmnop [[qrs]] tuv [[xx xxxx|wxyz]] 0123456789"

I want this result:

abcdef ghijk lmnop qrs tuv wxyz 0123456789

Having reviewed numerous questions and answers here, the closest I have come to a solution is:

s = "abcdef [[xxxx xxx|ghijk]] lmnop [[qrs]] tuv [[xx xxxx|wxyz]] 0123456789"
s = re.sub('\[\[.*?\|', '', s)
s = re.sub('[\]\]]', '', s)
--> abcdef ghijk lmnop wxyz 0123456789

Since not every substring within double brackets contains a pipe, the re.sub removes everything from '[[' to next '|' instead of checking within each set of double brackets.

Any assistance would be most appreciated.

3
  • Can you please describe the discriminating factor, i.e., why those particular substrings? Commented Jun 30, 2015 at 21:35
  • Try using raw strings for your regular expressions, like r'\[\[.*?\|'. Commented Jun 30, 2015 at 21:36
  • @TigerhawkT3 in my text whenever a pipe occurs between a set of double brackets, everything before the pipe is a description I don't need in the final result. Commented Jun 30, 2015 at 22:05

4 Answers 4

1

What about this:

In [187]: re.sub(r'([\[|\]])|((?<=\[)\w+\s+\w+(?=|))', '', s)
Out[187]: 'abcdef ghijk lmnop qrs tuv wxyz 0123456789'
Sign up to request clarification or add additional context in comments.

1 Comment

Should have specified that the x's between '[[' and the '|' represent a substring unknown to me. How can I alter your suggested regex to include any character?
1

I purpose you a contrary method, instead of remove it you can just catch patterns you want. I think this way can make your code more semantics.

There are two patterns you wish to catch:

  1. Case: words outside [[...]]

    Pattern: Any words are either leaded by ']] ' or trailed by ' [['.

    Regex: (?<=\]\]\s)\w+|\w+(?=\s\[\[)

  2. Case: words inside [[...]]

    Pattern: Any words are trailed by ']]'

    Regex: \w+(?=\]\])

Example code

1 #!/usr/bin/env python
2 import re
3
4 s = "abcdef [[xxxx xxx|ghijk]] lmnop [[qrs]] tuv [[xx xxxx|wxyz]] 0123456789    "
5
6 p = re.compile('(?<=\]\]\s)\w+|\w+(?=\s\[\[)|\w+(?=\]\])')
7 print p.findall(s)

Result:

['abcdef', 'ghijk', 'lmnop', 'qrs', 'tuv', 'wxyz', '0123456789']

Comments

0
>>> import re
>>> s = "abcdef [[xxxx xxx|ghijk]] lmnop [[qrs]] tuv [[xx xxxx|wxyz]] 0123456789"
>>> re.sub(r'(\[\[[^]]+?\|)|([\[\]])', '', s)
'abcdef ghijk lmnop qrs tuv wxyz 0123456789'

This searches for and removes the following two items:

  1. Two opening brackets followed by a bunch of stuff that isn't a closing bracket followed by a pipe.
  2. Opening or closing brackets.

1 Comment

just ran your solution through a sample of my file and it worked with every entry. Comes after a lot of frustration on the issue. Many thanks!
0

As a general regex using built-in re module you can use follwing regex that used look-around:

(?<!\[\[)\b([\w]+)\b(?!\|)|\[\[([^|]*)\]\]

you can use re.finditer to get the desire result :

>>> g=re.finditer(r'(?<!\[\[)\b([\w]+)\b(?!\|)|(?<=\[\[)[^|]*(?=\]\])',s)
>>> [j.group() for j in g]
['abcdef', 'ghijk', 'lmnop', 'qrs', 'tuv', 'wxyz', '0123456789']

The preceding regex contains from 2 part one is :

(?<=\[\[)[^|]*(?=\]\])

which match any combinations of word characters that not followed by | and not precede by [[.

the second part is :

\[\[([^|]*)\]\]

that will match any thing between 2 brackets except |.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.