Regex and python to remove a certain format of string

Question

I have a list of strings in the following form:

d = ['0.04M sodium propionate', ' 0.02M sodium cacodylate', ' 0.04M bis-tris propane', ' pH 8.0 ']

I want to remove x.xxM but keep the number following pH. I tried the following:

import re
for i in range(len(d)):
    d[i] = d[i].translate(None,'[1-9]+\.*[0-9]*M')

which produced the following:

>>> d
['4 sodium propionate', ' 2 sodium cacodylate', ' 4 bistris propane', ' pH 8 ']

removing the .0 from the pH as well. I think translate() does not take order into account, right? Also, I don't understand why the 4, 2 etc. still remain in either of the elements. How could I remove the pieces of strings strictly in the form [1-9]+\.*[0-9]*M (meaning that there should be a digit, maybe followed by a . and zero or more digits, and an M)?

Edit: I know realize that using regex doesn't work with translate(). It matches the 0, ., and M and removes them. I guess I can try re.search(), find the exact piece of string, and then do sub().

Have you read the documentation for translate? because it's totally unfit for the job — Karoly Horvath
– Karoly Horvath, Commented May 5, 2015 at 21:15
I thought I was already using it. I'll add it to the question. — sodiumnitrate
– sodiumnitrate, Commented May 5, 2015 at 21:15
@KarolyHorvath I had, but I now realize that using regex is just plain wrong. What can I do instead? — sodiumnitrate
– sodiumnitrate, Commented May 5, 2015 at 21:17
the obvious thing, read the documentation for the regex module help(re) — Karoly Horvath
– Karoly Horvath, Commented May 5, 2015 at 21:25

Jerry · Accepted Answer · 2015-05-05 21:41:24Z

3

I think your regex is almost correct, just that you should have used re.sub instead:

import re
for i in range(len(d)):
    d[i] = re.sub(r'[0-9]+\.[0-9]*M *', '', d[i])

ideone demo

So that d becomes:

['sodium propionate', ' sodium cacodylate', ' bis-tris propane', ' pH 8.0 ']

I did minimum modifications to your regex, but here is what each part means:

[0-9]+   # Match at least 1 number (a number between 0 to 9 inclusive)
\.       # Match a literal dot
[0-9]*   # Match 0 or more numbers (0 through 9 inclusive)
M *      # Match the character 'M' and any spaces following it

answered May 5, 2015 at 21:41

Jerry

71.8k14 gold badges106 silver badges148 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

cge · Accepted Answer · 2015-05-05 21:33:11Z

1

Why would you use re.search and then re.sub? You just need re.sub. You also want to do two completely different things, so it make sense to split them in two.

In [8]: d = ['0.04M sodium propionate', ' 0.02M sodium cacodylate', ' 0.04M bis-tris propane', ' pH 8.0 ']

In [9]: d1 = [ re.sub(r"\d\.\d\dM", "",x) for x in d ]
In [10]: d1
Out[10]: [' sodium propionate', '  sodium cacodylate', '  bis-tris propane', ' pH 8.0 ']

In [11]: d2 = [ re.sub(r"pH (\d+)\.\d+",r"pH \1", x) for x in d1 ]

In [12]: d2
Out[12]: [' sodium propionate', '  sodium cacodylate', '  bis-tris propane', ' pH 8 ']

Note that I used \d, which is shorthand for any numeral.

answered May 5, 2015 at 21:33

cge

10k3 gold badges36 silver badges52 bronze badges

Comments

Adam Matan · Accepted Answer · 2015-05-05 21:34:42Z

1

Cnosider re.sub:

re.sub(pattern, repl, string, count=0, flags=0)

Return the string obtained by replacing the leftmost non-overlapping occurrences of pattern in string by the replacement repl.

In your case:

>>> re.sub(r'\d\.\d(\d).', r'\1', '0.04M sodium propionate')
'4 sodium propionate'

answered May 5, 2015 at 21:34

Adam Matan

138k155 gold badges414 silver badges584 bronze badges

Comments

user2963623 · Accepted Answer · 2015-05-05 22:06:45Z

1

How about a quick and dirty

[re.sub(r'\b[.\d]+M\b', '', a).strip() for a in d]

which gives

['sodium propionate', 'sodium cacodylate', 'bis-tris propane', 'pH 8.0']

where [.\d]+ matches any continuous sequence of digits and dots, M for the molar. The two \b ensures it's a word and a strip() to chop off excess whitespaces!

edited May 5, 2015 at 22:06

answered May 5, 2015 at 22:00

user2963623

2,2951 gold badge16 silver badges26 bronze badges

Comments

Hua2308 · Accepted Answer · 2015-05-05 21:49:34Z

0

Here is regex pattern to filter out x.xxM:

[\d|.]+M

It means a string with digit(\d) or(|) dot(.) appearing more than 0 times(+) ending with M(M).

And here is the code:

result = [re.sub(r'[\d|.]+M',r'',i) for i in d]
# re.sub(A,B,Str) replaces all A with B in Str.

yielding this result:

[' sodium propionate', '  sodium cacodylate', '  bis-tris propane', ' pH 8.0 ']

answered May 5, 2015 at 21:49

Hua2308

4696 silver badges14 bronze badges

17 Comments

Jerry Over a year ago

Are you aware that this regex matches | as well?

sodiumnitrate Over a year ago

@Jerry do we need to do [\d\|.]+M?

Jerry Over a year ago

@sodiumnitrate This can be problematic even if it might not actually cause a problem when running the code. [\d|.]+ will match a number, | or .. If you ever get badly formatted input like 12|.32M the regex will match it. Or even .....M due to the way the regex is constructed.

Hua2308 Over a year ago

@Jerry vertical bar '|' is an OR operator in regex. Unless it is escaped like '\ |', it would not be matched literally in any manner. The regex is not perfect since I forgot to escape dot which virtually matches any character here including vertical bar as you mentioned.

Hua2308 Over a year ago

@sodiumnitrate here is a new regex and sorry for the inconvenience: ((\d)|[.])+M

|

Collectives™ on Stack Overflow

Regex and python to remove a certain format of string

5 Answers 5

Comments

Comments

Comments

Comments

17 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

Comments

Comments

Comments

Comments

17 Comments

Your Answer

Sign up or log in

Post as a guest

Related