1

I have ingredients for thousands of products for example:

Ingredient = 'Beef stock (beef bones, water, onion, carrot, beef meat, parsnip, thyme, parsley, clove, black pepper, bay leaf), low lactose cream (28%), onion, mustard, modified maize starch,tomato puree, modified potato starch, butter sugar, salt (0,8%), burnt sugar, blackcurrant, peppercorns (black, pink, green, all spice, white) 0,4%.'

I want this ingredient in the form of a list like the following:

listOfIngredients = ['Beef Stock', 'low lactose cream', 'onion', 'mustard', 'modified maize starch','tomato puree', 'modified potato starch', 'butter sugar', 'salt', 'burnt sugar', 'blackcurrant', 'peppercorns']

So in the listOfIngredients I do not have any explanations of the product in percentage or even further products that one ingredient itself contains. Regex is a good way of doing this but I am not good at making regex. Can someone help me in making regex to get the desired output. Thanks in advance.

5
  • How do you know onion, mustard should be onion mustard (comma is missing in the expected results)? Commented Oct 13, 2016 at 10:27
  • You can't just ask people to write code for you. What you have tried so far? Commented Oct 13, 2016 at 10:40
  • @WiktorStribiżew sorry. edited. 'onion', 'mustard' Commented Oct 13, 2016 at 10:48
  • Good, so, you had a typo, but did you at least think of an approach to get the strings you need? You may be not good at regex, but you must have thought of the specs, right? Commented Oct 13, 2016 at 10:48
  • @WiktorStribiżew I have not worked with regex much so I tried some shitty things like strip and rstrip and slicing techniques but they are giving shitty results. That is I want to get started with regex Commented Oct 13, 2016 at 10:53

1 Answer 1

1

You might try two approaches.

The first one is to remove all (...) substrings and anything that is not , after (that is not followed with non-word boundary).

\s*\([^()]*\)[^,]*(?:,\b[^,]*)*

See the regex demo

Details:

  • \s* - 0+ whitespaces
  • \([^()]*\) - a (...) substring having no ( and ) inside:
    • \( - a literal (
    • [^()]* - 0+ chars other than ( and ) (a [^...] is a negated character class)
  • [^,]* - 0+ chars other than ,
  • (?:,\b[^,]*)* - zero or more sequences of:
    • ,\b - a comma that is followed with a letter/digit/underscore
    • [^,]* - 0+ chars other than ,.

These matches are removed, and then ,\s* regex is used to split the string with a comma and 0+ whitespaces to get the final result.

The second one is based on matching and capturing words consisting of letters (and _) only, and just matching (...) substrings.

\([^()]*\)|([^\W\d]+(?:\s+[^\W\d]+)*)

See the second regex demo

Details:

  • \([^()]*\) - a (...) substring having no ( and ) inside
  • | - or
  • ([^\W\d]+(?:\s+[^\W\d]+)*) - Group 1 capturing:
    • [^\W\d]+ - 1+ letters or underscores (you may add _ after \d to exclude underscores)
    • (?:\s+[^\W\d]+)* - 0+ sequences of:
      • \s+ - 1 or more whitespaces
      • [^\W\d]+ - 1+ letters or underscores

Both return the same results for the current string, but you may want to adjust it in future.

See Python demo:

import re
Ingredient = 'Beef stock (beef bones, water, onion, carrot, beef meat, parsnip, thyme, parsley, clove, black pepper, bay leaf), low lactose cream (28%), onion, mustard, modified maize starch,tomato puree, modified potato starch, butter sugar, salt (0,8%), burnt sugar, blackcurrant, peppercorns (black, pink, green, all spice, white) 0,4%.'
res = re.sub(r'\s*\([^()]*\)[^,]*(?:,\b[^,]*)*', "", Ingredient)
print(re.split(r',\s*', res))

vals = re.findall(r'\([^()]*\)|([^\W\d]+(?:\s+[^\W\d]+)*)', Ingredient)
vals = [x for x in vals if x]
print(vals)
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks a lot. It was quite helpful :)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.