2

I have a dictionary of words with their frequencies as follows.

mydictionary = {'yummy tim tam':3, 'milk':2, 'chocolates':5, 'biscuit pudding':3, 'sugar':2}

I have a set of strings (removed punctuation marks) as follows.

recipes_book = "For todays lesson we will show you how to make biscuit pudding using 
yummy tim tam milk and rawsugar"

In the above string I need output only "biscuit pudding", "yummy tim tam" and "milk" by referring the dictionary. NOT sugar, because its rawsugar in the string.

However, the code I am currently using outputs sugar as well.

mydictionary = {'yummy tim tam':3, 'milk':2, 'chocolates':5, 'biscuit pudding':3, 'sugar':2}
recipes_book = "For today's lesson we will show you how to make biscuit pudding using yummy tim tam milk and rawsugar"
searcher = re.compile(r'{}'.format("|".join(mydictionary.keys())), flags=re.I | re.S)

for match in searcher.findall(recipes_book):
    print(match)

How to avoid using sub-strings like that and only consider one full tokens such as 'milk'. Please help me.

2
  • 3
    use word boundary \b Commented Oct 3, 2017 at 10:21
  • Why did you accept an answer if it does not work for you? Update the question since it is the same issue you described here. Word boundaries are only a part of the solution here. Commented Oct 3, 2017 at 12:15

3 Answers 3

1

Use word boundary '\b'. In simple words

recipes_book = "For todays lesson we will show you how to make biscuit pudding using 
yummy tim tam milk and rawsugar"

>>> re.findall(r'(?is)(\bchocolates\b|\bbiscuit pudding\b|\bsugar\b|\byummy tim tam\b|\bmilk\b)',recipes_book)
['biscuit pudding', 'yummy tim tam', 'milk']
Sign up to request clarification or add additional context in comments.

2 Comments

Without hard coding my dictionary keys in re.findall is there any easy way of doing it?
I am just illustrating the use of \b word boundary here, you can edit accordingly, just check @Delimitry answer :)
0

You can update your code with regex word boundary:

mydictionary = {'yummy tim tam':3, 'milk':2, 'chocolates':5, 'biscuit pudding':3, 'sugar':2}
recipes_book = "For today's lesson we will show you how to make biscuit pudding using yummy tim tam milk and rawsugar"
searcher = re.compile(r'{}'.format("|".join(map(lambda x: r'\b{}\b'.format(x), mydictionary.keys()))), flags=re.I | re.S)

for match in searcher.findall(recipes_book):
    print(match)

Output:

biscuit pudding
yummy tim tam
milk

1 Comment

You may also remove re.S as it does not make any difference.
0

One more way using re.escape. More info regarding re.escape here !!!

import re

mydictionary = {'yummy tim tam':3, 'milk':2, 'chocolates':5, 'biscuit pudding':3, 'sugar':2}
recipes_book = "For today's lesson we will show you how to make biscuit pudding using yummy tim tam milk and rawsugar"

val_list = []

for i in mydictionary.keys():
    tmp_list = []
    regex_tmp = r'\b'+re.escape(str(i))+r'\b'
    tmp_list = re.findall(regex_tmp,recipes_book)
    val_list.extend(tmp_list)

print val_list

Output:

"C:\Program Files (x86)\Python27\python.exe" C:/Users/punddin/PycharmProjects/demo/demo.py
['yummy tim tam', 'biscuit pudding', 'milk']

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.