0

I want to use URL (https://www.bbc.co.uk/food/sitemap.xml) to get list of recipes url's in Python. I try to use xmltodict, but as I can see, it does not convert the text in a good way. My code:

import urllib.request
import xmltodict

with urllib.request.urlopen('https://www.bbc.co.uk/food/sitemap.xml') as url:
    data = url.read()

data = xmltodict.parse(data)
print(data)

And part of the result:

[...] OrderedDict([('loc', 'https://www.bbc.co.uk/food/recipes/yoghurtspicedchicken_74830'), ('lastmod', '2012-06-07'), ('image:image', OrderedDict([('image:loc', 'https://ichef.bbci.co.uk/food/ic/food_16x9_320/recipes/yoghurtspicedchicken_74830_16x9.jpg')]))]), OrderedDict([('loc', 'https://www.bbc.co.uk/food/recipes/yoghurt_and_muesli_61842'), ('lastmod', '2018-04-18')]), OrderedDict([('loc', 'https://www.bbc.co.uk/food/recipes/yoghurt_cake_87253'), ('lastmod', '2020-03-31'), ('image:image', OrderedDict([('image:loc', 'https://ichef.bbci.co.uk/food/ic/food_16x9_320/recipes/yoghurt_cake_87253_16x9.jpg')]))]), OrderedDict([('loc', 'https://www.bbc.co.uk/food/recipes/yorkshirecurdpie_86473'), ('lastmod', '2019-05-23')]), OrderedDict([('loc', 'https://www.bbc.co.uk/food/recipes/yorkshireparkin_83745'), ('lastmod', '2019-01-02')]), OrderedDict([('loc', 'https://www.bbc.co.uk/food/recipes/yorkshirepotwithchri_87677'), ('lastmod', '2018-12-03')]), OrderedDict([('loc', 'https://www.bbc.co.uk/food/recipes/yorkshirepuddingswit_92145'), ('lastmod', '2016-09-13')]), OrderedDict([('loc', 'https://www.bbc.co.uk/food/recipes/yorkshirepuddings_86010'), ('lastmod', '2018-08-08'), ('image:image', OrderedDict([('image:loc', 'https://ichef.bbci.co.uk/food/ic/food_16x9_320/recipes/yorkshirepuddings_86010_16x9.jpg')]))]), OrderedDict([('loc', 'https://www.bbc.co.uk/food/recipes/yorkshirepuddingviap_9974'), ('lastmod', '2015-12-07')]), OrderedDict([('loc', 'https://www.bbc.co.uk/food/recipes/yorkshirepuddingwith_83703'), ('lastmod', '2018-10-30')]), OrderedDict([('loc', 'https://www.bbc.co.uk/food/recipes/yorkshirepudding_81824'), ('lastmod', '2019-01-21'), ('image:image', OrderedDict([('image:loc', 'https://ichef.bbci.co.uk/food/ic/food_16x9_320/recipes/yorkshirepudding_81824_16x9.jpg')]))]), OrderedDict([('loc', 'https://www.bbc.co.uk/food/recipes/yorkshirepudding_93848'), ('lastmod', '2018-08-08'), ('image:image', OrderedDict([('image:loc', 'https://ichef.bbci.co.uk/food/ic/food_16x9_320/recipes/yorkshirepudding_93848_16x9.jpg')]))]), OrderedDict([('loc', 'https://www.bbc.co.uk/food/recipes/yorkshire_curd_tart_20002'), ('lastmod', '2019-01-03'), ('image:image', OrderedDict([('image:loc', 'https://ichef.bbci.co.uk/food/ic/food_16x9_320/recipes/yorkshire_curd_tart_20002_16x9.jpg')]))]), OrderedDict([('loc', 'https://www.bbc.co.uk/food/recipes/yorkshire_curd_tart_23874'), ('lastmod', '2019-01-03')]), OrderedDict([('loc', 'https://www.bbc.co.uk/food/recipes/yorkshire_curd_tart_63644'), ('lastmod', '2016-09-19')]), OrderedDict([('loc', 'https://www.bbc.co.uk/food/recipes/yorkshire_oatmeal_parkin_13911'), ('lastmod', '2016-09-19')]), OrderedDict([('loc', 'https://www.bbc.co.uk/food/recipes/yorkshire_puddings_61798'), ('lastmod', '2018-11-28'), ('image:image', OrderedDict([('image:loc', 'https://ichef.bbci.co.uk/food/ic/food_16x9_320/recipes/yorkshire_puddings_61798_16x9.jpg')]))]), OrderedDict([('loc', 'https://www.bbc.co.uk/food/recipes/yorkshire_puddings_and_40867'), ('lastmod', '2018-12-04')]), OrderedDict([('loc', 'https://www.bbc.co.uk/food/recipes/yorkshire_puddings_with_15870'), ('lastmod', '2018-04-30')]), OrderedDict([('loc', 'https://www.bbc.co.uk/food/recipes/yorkshire_puddings_with_50889'), ('lastmod', '2019-02-11')]), OrderedDict([('loc', 'https://www.bbc.co.uk/food/recipes/yorkshire_pudding_69240'), ('lastmod', '2019-12-10'), ('image:image', OrderedDict([('image:loc', 'https://ichef.bbci.co.uk/food/ic/food_16x9_320/recipes/yorkshire_pudding_69240_16x9.jpg')]))]), OrderedDict([('loc', 'https://www.bbc.co.uk/food/recipes/yorkshire_pudding_wraps_73052'), ('lastmod', '2019-09-30'), ('image:image', OrderedDict([('image:loc', 'https://ichef.bbci.co.uk/food/ic/food_16x9_320/recipes/yorkshire_pudding_wraps_73052_16x9.jpg')]))]), OrderedDict([('loc', 'https://www.bbc.co.uk/food/recipes/yorkshire_tapas_puddings_93245'), ('lastmod', '2016-09-14'), ('image:image', OrderedDict([('image:loc', 'https://ichef.bbci.co.uk/food/ic/food_16x9_320/recipes/yorkshire_tapas_puddings_93245_16x9.jpg')]))]), [...]

I want to get only URLs, that are included in tag in XML and filter them according to the pattern "https://www.bbc.co.uk/food/recipes/"

3
  • in this case, it could be easy to extract the URLs using regex. Commented Jul 17, 2020 at 22:26
  • How much of the document do you want, just the url/loc urls? Commented Jul 17, 2020 at 22:32
  • Yes, I want to make list of urls that are in url/loc and additionally fits in the pattern. Commented Jul 17, 2020 at 22:34

1 Answer 1

2

Instead of the convenience method, xmltodict, that works on simpler, flatter XML documents, consider parsing the XML and map to dictionary with Python's built-in xml.etree module.

Be sure to assign namespaces and conditionally retrieve image since it is not always present under <url> nodes.

import urllib.request
import xml.etree.ElementTree as et

with urllib.request.urlopen('https://www.bbc.co.uk/food/sitemap.xml') as url:
    data = url.read()

xml = et.fromstring(data)
nsmp = {"doc": "http://www.sitemaps.org/schemas/sitemap/0.9",
        "image" : "http://www.google.com/schemas/sitemap-image/1.1"}
       
recipies_dict = [] 

for url in xml.findall('doc:url', namespaces = nsmp):
   loc = url.find('doc:loc', namespaces = nsmp).text
  
   img_node = url.find('image:image', namespaces = nsmp)   
   img = img_node.find('image:loc', namespaces = nsmp).text if img_node is not None else None

   recipies_dict.append({'loc':loc, 'img': img})

Output

len(recipes_dict)
# 20084

recipes_dict[1:20]    
# {'loc': 'https://www.bbc.co.uk/food/', 'img': None}
# {'loc': 'https://www.bbc.co.uk/food/recipes', 'img': None}
# {'loc': 'https://www.bbc.co.uk/food/chefs', 'img': None}
# {'loc': 'https://www.bbc.co.uk/food/programmes', 'img': None}
# {'loc': 'https://www.bbc.co.uk/food/ingredients', 'img': None}
# {'loc': 'https://www.bbc.co.uk/food/seasons', 'img': None}
# {'loc': 'https://www.bbc.co.uk/food/occasions', 'img': None}
# {'loc': 'https://www.bbc.co.uk/food/cuisines', 'img': None}
# {'loc': 'https://www.bbc.co.uk/food/techniques', 'img': None}
# {'loc': 'https://www.bbc.co.uk/food/recipes/10minutepizza_87314', 'img': None}
# {'loc': 'https://www.bbc.co.uk/food/recipes/15_minute_pasta_33407', 'img': 'https://ichef.bbci.co.uk/food/ic/food_16x9_320/recipes/15_minute_pasta_33407_16x9.jpg'}
# {'loc': 'https://www.bbc.co.uk/food/recipes/1_creamy_chicken_pasta_24218', 'img': 'https://ichef.bbci.co.uk/food/ic/food_16x9_320/recipes/1_creamy_chicken_pasta_24218_16x9.jpg'}
# {'loc': 'https://www.bbc.co.uk/food/recipes/1_hoisin_spinach_and_egg_86057', 'img': 'https://ichef.bbci.co.uk/food/ic/food_16x9_320/recipes/1_hoisin_spinach_and_egg_86057_16x9.jpg'}
# {'loc': 'https://www.bbc.co.uk/food/recipes/1_mixed_vegetable_and_84703', 'img': 'https://ichef.bbci.co.uk/food/ic/food_16x9_320/recipes/1_mixed_vegetable_and_84703_16x9.jpg'}
# {'loc': 'https://www.bbc.co.uk/food/recipes/2_hour_christmas_dinner_79341', 'img': 'https://ichef.bbci.co.uk/food/ic/food_16x9_320/recipes/2_hour_christmas_dinner_79341_16x9.jpg'}
# {'loc': 'https://www.bbc.co.uk/food/recipes/3d_biscuits_29555', 'img': 'https://ichef.bbci.co.uk/food/ic/food_16x9_320/recipes/3d_biscuits_29555_16x9.jpg'}
# {'loc': 'https://www.bbc.co.uk/food/recipes/3wayswithlemoncurd_67266', 'img': None}
# {'loc': 'https://www.bbc.co.uk/food/recipes/3_stir-fry_sauces_52376', 'img': 'https://ichef.bbci.co.uk/food/ic/food_16x9_320/recipes/3_stir-fry_sauces_52376_16x9.jpg'}
# {'loc': 'https://www.bbc.co.uk/food/recipes/5-ingredient_33925', 'img': 'https://ichef.bbci.co.uk/food/ic/food_16x9_320/recipes/5-ingredient_33925_16x9.jpg'}
# {'loc': 'https://www.bbc.co.uk/food/recipes/5-minute_chicken_noodle_78996', 'img': 'https://ichef.bbci.co.uk/food/ic/food_16x9_320/recipes/5-minute_chicken_noodle_78996_16x9.jpg'}
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.