2

Let's say I have a string like this:

s = '(X_xy09 and X_foobar or (X_abc123 and X_something))'

and I want to turn it into

'(xy09 and foobar or (abc123 and something))'

then - in this particular case - I could simply do

s.replace('X_', "")

which gives the desired output.

However, in my actual data there might be not only X_ but also other prefixes, so the above replace statement does not work.

What I would need instead is a replacement of

a capital letter followed by an underscore and an arbitrary sequence of letters and numbers

by

everything after the first underscore.

So, to extract the desired elements I could use:

import re
print(re.findall('[A-Z]{1}_[a-zA-Z0-9]+', s))

which prints

['X_xy09', 'X_foobar', 'X_abc123', 'X_something']

how can I now replace those elements so that I obtain

'(xy09 and foobar or (abc123 and something))'

?

5
  • 1
    Tried using re.sub? Commented Dec 6, 2017 at 14:09
  • @WiktorStribiżew: Would love to, but don't know how to use it in this case. If you know how, feel free to post it as an answer... :) Commented Dec 6, 2017 at 14:10
  • Try ideone.com/Qs9ldO. What are the exact criteria for the pattern? Should it start with a word boundary or only after (? Commented Dec 6, 2017 at 14:11
  • @WiktorStribiżew: Ah, backreferencing! Great, that solves it. Please post it as an answer, then I upvote and accept. The pattern is as simple as described: capital letter, underscore, some arbitrary stuff. There are not always ( involved. Commented Dec 6, 2017 at 14:12
  • re.sub(r'[A-Z]_(?=[a-zA-Z0-9])', '', s) Commented Dec 6, 2017 at 14:13

4 Answers 4

3

If you need to remove an uppercase ASCII letter with an underscore after it, only when not preceded with a word char and when followed with an alphanumeric char, you may use

import re
s = '(X_xy09 and X_foobar or (X_abc123 and X_something))'
print(re.sub(r'\b[A-Z]_([a-zA-Z0-9])', r'\1', s))

See the Python demo and a regex demo.

Pattern details

  • \b - a leading word boundary
  • [A-Z]_ - an ASCII uppercase letter and _
  • ([a-zA-Z0-9]) - Group 1 (later referenced to with \1 from the replacement pattern): 1 alphanumeric char.
Sign up to request clarification or add additional context in comments.

1 Comment

Note that it is equal to re.sub(r'\b[A-Z]_(?=[a-zA-Z0-9])', '', s)
3

If you just need to replace a capital letter followed by an underscore, you can use the regular expression r'[A-Z]_'.

s = '(X_xy09 and X_foobar or (X_abc123 and X_something))'
re.sub(r'[A-Z]_', '', s)

You may need to add to it if you have other criteria not mentioned. (For example, some of your target values follow a word boundary and some follow parentheses.) The above might give you the wrong output if you have input like XY_something. It depends on what you expect the output to be.

1 Comment

Nice solution, too.
2

Another re.sub() approach:

import re

s = '(X_xy09 and X_foobar or (X_abc123 and X_something))'
result = re.sub(r'[A-Z]_(?=[a-zA-Z0-9]+)', '', s)

print(result)

The output:

(xy09 and foobar or (abc123 and something))

  • [A-Z]_(?=[a-zA-Z0-9]+) - (?=...) positive lookahead assertion, ensures that substituted [A-Z]_ substring is followed by alphanumeric sequence [a-zA-Z0-9]+

2 Comments

Works fine. What is the ?= part for?
@Cleb, that's positive lookahead assertion, see my explanation
2

You could use re.sub() with a lookahead assertion:

>>> import re
>>> s = '(X_xy09 and X_foobar or (X_abc123 and X_something))'
>>> re.sub(r'\b[A-Z]_(?=[a-zA-Z0-9])', '', s)
'(xy09 and foobar or (abc123 and something))'

from the docs:

(?=...)
Matches if ... matches next, but doesn’t consume any of the string. This is called a lookahead assertion. For example, Isaac (?=Asimov) will match 'Isaac ' only if it’s followed by 'Asimov'.

1 Comment

Seems to be the same as RomanPerekhrest' answer but still deserves an upvote... :)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.