0

I have a list of strings in the following form:

d = ['0.04M sodium propionate', ' 0.02M sodium cacodylate', ' 0.04M bis-tris propane', ' pH 8.0 ']

I want to remove x.xxM but keep the number following pH. I tried the following:

import re
for i in range(len(d)):
    d[i] = d[i].translate(None,'[1-9]+\.*[0-9]*M')

which produced the following:

>>> d
['4 sodium propionate', ' 2 sodium cacodylate', ' 4 bistris propane', ' pH 8 ']

removing the .0 from the pH as well. I think translate() does not take order into account, right? Also, I don't understand why the 4, 2 etc. still remain in either of the elements. How could I remove the pieces of strings strictly in the form [1-9]+\.*[0-9]*M (meaning that there should be a digit, maybe followed by a . and zero or more digits, and an M)?

Edit: I know realize that using regex doesn't work with translate(). It matches the 0, ., and M and removes them. I guess I can try re.search(), find the exact piece of string, and then do sub().

7
  • Have you tried using the regex module (import re)? Commented May 5, 2015 at 21:14
  • 2
    Have you read the documentation for translate? because it's totally unfit for the job Commented May 5, 2015 at 21:15
  • I thought I was already using it. I'll add it to the question. Commented May 5, 2015 at 21:15
  • @KarolyHorvath I had, but I now realize that using regex is just plain wrong. What can I do instead? Commented May 5, 2015 at 21:17
  • the obvious thing, read the documentation for the regex module help(re) Commented May 5, 2015 at 21:25

5 Answers 5

3

I think your regex is almost correct, just that you should have used re.sub instead:

import re
for i in range(len(d)):
    d[i] = re.sub(r'[0-9]+\.[0-9]*M *', '', d[i])

ideone demo

So that d becomes:

['sodium propionate', ' sodium cacodylate', ' bis-tris propane', ' pH 8.0 ']

I did minimum modifications to your regex, but here is what each part means:

[0-9]+   # Match at least 1 number (a number between 0 to 9 inclusive)
\.       # Match a literal dot
[0-9]*   # Match 0 or more numbers (0 through 9 inclusive)
M *      # Match the character 'M' and any spaces following it
Sign up to request clarification or add additional context in comments.

Comments

1

Why would you use re.search and then re.sub? You just need re.sub. You also want to do two completely different things, so it make sense to split them in two.

In [8]: d = ['0.04M sodium propionate', ' 0.02M sodium cacodylate', ' 0.04M bis-tris propane', ' pH 8.0 ']

In [9]: d1 = [ re.sub(r"\d\.\d\dM", "",x) for x in d ]
In [10]: d1
Out[10]: [' sodium propionate', '  sodium cacodylate', '  bis-tris propane', ' pH 8.0 ']

In [11]: d2 = [ re.sub(r"pH (\d+)\.\d+",r"pH \1", x) for x in d1 ]

In [12]: d2
Out[12]: [' sodium propionate', '  sodium cacodylate', '  bis-tris propane', ' pH 8 ']

Note that I used \d, which is shorthand for any numeral.

Comments

1

Cnosider re.sub:

re.sub(pattern, repl, string, count=0, flags=0)

Return the string obtained by replacing the leftmost non-overlapping occurrences of pattern in string by the replacement repl.

In your case:

>>> re.sub(r'\d\.\d(\d).', r'\1', '0.04M sodium propionate')
'4 sodium propionate'

Comments

1

How about a quick and dirty

[re.sub(r'\b[.\d]+M\b', '', a).strip() for a in d]

which gives

['sodium propionate', 'sodium cacodylate', 'bis-tris propane', 'pH 8.0']

where [.\d]+ matches any continuous sequence of digits and dots, M for the molar. The two \b ensures it's a word and a strip() to chop off excess whitespaces!

Comments

0

Here is regex pattern to filter out x.xxM:

[\d|.]+M

It means a string with digit(\d) or(|) dot(.) appearing more than 0 times(+) ending with M(M).

And here is the code:

result = [re.sub(r'[\d|.]+M',r'',i) for i in d]
# re.sub(A,B,Str) replaces all A with B in Str.

yielding this result:

[' sodium propionate', '  sodium cacodylate', '  bis-tris propane', ' pH 8.0 ']

17 Comments

Are you aware that this regex matches | as well?
@Jerry do we need to do [\d\|.]+M?
@sodiumnitrate This can be problematic even if it might not actually cause a problem when running the code. [\d|.]+ will match a number, | or .. If you ever get badly formatted input like 12|.32M the regex will match it. Or even .....M due to the way the regex is constructed.
@Jerry vertical bar '|' is an OR operator in regex. Unless it is escaped like '\ |', it would not be matched literally in any manner. The regex is not perfect since I forgot to escape dot which virtually matches any character here including vertical bar as you mentioned.
@sodiumnitrate here is a new regex and sorry for the inconvenience: ((\d)|[.])+M
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.