1

I have a large list of chemical data, that contains entries like the following:

1. 2,4-D, Benzo(a)pyrene, Dioxin, PCP, 2,4,5-TP
2. Lead,Paints/Pigments,Zinc

I have a function that is correctly splitting the 1st entry into: ['2,4-D', 'Benzo(a)pyrene', 'Dioxin', 'PCP', '2,4,5-TP']

based on ', ' as a separator. For the second entry, ', ' won't work. But, if i could easily split any string that contains ',' with only two non-numeric characters on either side, I would be able to parse all entries like the second one, without splitting up the chemicals in entries like the first, that have numbers in their name separated by commas (i.e. 2,4,5-TP).

Is there an easy pythonic way to do this?

3 Answers 3

2

I explain a little bit based on @eph's answer:

import re

data_list = ['2,4-D, Benzo(a)pyrene, Dioxin, PCP, 2,4,5-TP', 'Lead,Paints/Pigments,Zinc']
for d in data_list:
    print re.split(r'(?<=\D),\s*|\s*,(?=\D)',d)

re.split(pattern, string) will split string by the occurrences of regex pattern. (plz read Regex Quick Start if you are not familiar with regex.)

The (?<=\D),\s*|\s*,(?=\D) consists of two part: (?<=\D),\s* and \s*,(?=\D). The meaning of each unit:

  • The middle | is the OR operator.
  • \D matches a single character that is not a digit.
  • \s matches a whitespace character (includes tabs and line breaks).
  • , matches character ",".
  • * attempts to match the preceding token zero or more times. Therefore, \s* means the whitespace can be appear zero or more times. (see Repetition with Star and Plus)
  • (?<= ... ) and (?= ...) are the lookbebind and lookahead assertions. For example, q(?=u) matches a q that is followed by a u, without making the u part of the match.

Therefore, \s*,(?=\D) matches a , that is preceded by zero or more whitespace and followed by non-digit characters. Similarly, (?<=\D),\s* matches a , that is preceded by non-digit characters and followed by zero or more whitespace. The whole regex will find , that satisfy either case, which is equivalent to your requirement: ',' with only two non-numeric characters on either side.

Some useful tools for regex:

Sign up to request clarification or add additional context in comments.

Comments

1

Use regex and lookbehind/lookahead assertion

>>> re.split(r'(?<=\D\D),\s*|,\s*(?=\D\D)', s)
['2,4-D', 'Benzo(a)pyrene', 'Dioxin', 'PCP', '2,4,5-TP']

4 Comments

I have no idea what is going on in those parenthesis. Care to elucidate? Otherwise, I have no idea why I would be implementing what you propose, even if it worked. And you did not show how it would split both inputs correctly, so I'm not sure it even does....
@traggatmot Regular expression is a standard and efficient way to due with string transformations based on rules. If you don't have the basic idea of how it works, I suggest to read the manual of python re module or some tutorial.
I'll read the manual and follow your suggestion when it becomes apparent its a solid solution. The answer by Mayur at least demonstrates regular expression tools on both strings. But the fact your answers look completely different while achieving the same results doesn't generate a lot of motivation to spend a lot of time learning regular expressions.
@traggatmot Mayur's solution is based on finding words to extract while mine is based on splitting with separator ',' with only two non-numeric characters on either side. I don't put the second sentence because it is relatively obvious and you can validate it. But one thing you are right is that both solution do not guarantee to be always correct, that depends on more details of the data and you should think more careful.
0
>>> s1 = "2,4-D, Benzo(a)pyrene, Dioxin, PCP, 2,4,5-TP"
>>> s2 = "Lead,Paints/Pigments,Zinc"
>>> import re
>>> res1 = re.findall(r"\s*(.*?[A-Za-z])(?:,|$)", s1)
>>> res1
['2,4-D', 'Benzo(a)pyrene', 'Dioxin', 'PCP', '2,4,5-TP']
>>> res2 = re.findall(r"\s*(.*?[A-Za-z])(?:,|$)", s2)
>>> res2
['Lead', 'Paints/Pigments', 'Zinc']

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.