Identify strings while removing substrings in python

Question

I have a dictionary of words with their frequencies as follows.

mydictionary = {'yummy tim tam':3, 'milk':2, 'chocolates':5, 'biscuit pudding':3, 'sugar':2}

I have a set of strings (removed punctuation marks) as follows.

recipes_book = "For todays lesson we will show you how to make biscuit pudding using 
yummy tim tam milk and rawsugar"

In the above string I need output only "biscuit pudding", "yummy tim tam" and "milk" by referring the dictionary. NOT sugar, because its rawsugar in the string.

However, the code I am currently using outputs sugar as well.

mydictionary = {'yummy tim tam':3, 'milk':2, 'chocolates':5, 'biscuit pudding':3, 'sugar':2}
recipes_book = "For today's lesson we will show you how to make biscuit pudding using yummy tim tam milk and rawsugar"
searcher = re.compile(r'{}'.format("|".join(mydictionary.keys())), flags=re.I | re.S)

for match in searcher.findall(recipes_book):
    print(match)

How to avoid using sub-strings like that and only consider one full tokens such as 'milk'. Please help me.

Why did you accept an answer if it does not work for you? Update the question since it is the same issue you described here. Word boundaries are only a part of the solution here. — Wiktor Stribiżew
– Wiktor Stribiżew, Commented Oct 3, 2017 at 12:15

akash karothiya · Accepted Answer · 2017-10-03 10:26:32Z

1

Use word boundary '\b'. In simple words

recipes_book = "For todays lesson we will show you how to make biscuit pudding using 
yummy tim tam milk and rawsugar"

>>> re.findall(r'(?is)(\bchocolates\b|\bbiscuit pudding\b|\bsugar\b|\byummy tim tam\b|\bmilk\b)',recipes_book)
['biscuit pudding', 'yummy tim tam', 'milk']

answered Oct 3, 2017 at 10:26

akash karothiya

5,9601 gold badge21 silver badges29 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

user8566323 Over a year ago

Without hard coding my dictionary keys in re.findall is there any easy way of doing it?

akash karothiya Over a year ago

I am just illustrating the use of \b word boundary here, you can edit accordingly, just check @Delimitry answer :)

Delimitry · Accepted Answer · 2017-10-03 10:36:30Z

0

You can update your code with regex word boundary:

mydictionary = {'yummy tim tam':3, 'milk':2, 'chocolates':5, 'biscuit pudding':3, 'sugar':2}
recipes_book = "For today's lesson we will show you how to make biscuit pudding using yummy tim tam milk and rawsugar"
searcher = re.compile(r'{}'.format("|".join(map(lambda x: r'\b{}\b'.format(x), mydictionary.keys()))), flags=re.I | re.S)

for match in searcher.findall(recipes_book):
    print(match)

Output:

biscuit pudding
yummy tim tam
milk

edited Oct 3, 2017 at 10:36

answered Oct 3, 2017 at 10:30

Delimitry

3,0374 gold badges33 silver badges39 bronze badges

1 Comment

Wiktor Stribiżew Over a year ago

You may also remove re.S as it does not make any difference.

Dinesh Pundkar · Accepted Answer · 2017-10-03 10:40:37Z

0

One more way using re.escape. More info regarding re.escape here !!!

import re

mydictionary = {'yummy tim tam':3, 'milk':2, 'chocolates':5, 'biscuit pudding':3, 'sugar':2}
recipes_book = "For today's lesson we will show you how to make biscuit pudding using yummy tim tam milk and rawsugar"

val_list = []

for i in mydictionary.keys():
    tmp_list = []
    regex_tmp = r'\b'+re.escape(str(i))+r'\b'
    tmp_list = re.findall(regex_tmp,recipes_book)
    val_list.extend(tmp_list)

print val_list

Output:

"C:\Program Files (x86)\Python27\python.exe" C:/Users/punddin/PycharmProjects/demo/demo.py
['yummy tim tam', 'biscuit pudding', 'milk']

answered Oct 3, 2017 at 10:40

Dinesh Pundkar

4,1962 gold badges26 silver badges38 bronze badges

Collectives™ on Stack Overflow

Identify strings while removing substrings in python

3 Answers 3

2 Comments

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related