I want to use URL (https://www.bbc.co.uk/food/sitemap.xml) to get list of recipes url's in Python. I try to use xmltodict, but as I can see, it does not convert the text in a good way. My code:
import urllib.request
import xmltodict
with urllib.request.urlopen('https://www.bbc.co.uk/food/sitemap.xml') as url:
data = url.read()
data = xmltodict.parse(data)
print(data)
And part of the result:
[...] OrderedDict([('loc', 'https://www.bbc.co.uk/food/recipes/yoghurtspicedchicken_74830'), ('lastmod', '2012-06-07'), ('image:image', OrderedDict([('image:loc', 'https://ichef.bbci.co.uk/food/ic/food_16x9_320/recipes/yoghurtspicedchicken_74830_16x9.jpg')]))]), OrderedDict([('loc', 'https://www.bbc.co.uk/food/recipes/yoghurt_and_muesli_61842'), ('lastmod', '2018-04-18')]), OrderedDict([('loc', 'https://www.bbc.co.uk/food/recipes/yoghurt_cake_87253'), ('lastmod', '2020-03-31'), ('image:image', OrderedDict([('image:loc', 'https://ichef.bbci.co.uk/food/ic/food_16x9_320/recipes/yoghurt_cake_87253_16x9.jpg')]))]), OrderedDict([('loc', 'https://www.bbc.co.uk/food/recipes/yorkshirecurdpie_86473'), ('lastmod', '2019-05-23')]), OrderedDict([('loc', 'https://www.bbc.co.uk/food/recipes/yorkshireparkin_83745'), ('lastmod', '2019-01-02')]), OrderedDict([('loc', 'https://www.bbc.co.uk/food/recipes/yorkshirepotwithchri_87677'), ('lastmod', '2018-12-03')]), OrderedDict([('loc', 'https://www.bbc.co.uk/food/recipes/yorkshirepuddingswit_92145'), ('lastmod', '2016-09-13')]), OrderedDict([('loc', 'https://www.bbc.co.uk/food/recipes/yorkshirepuddings_86010'), ('lastmod', '2018-08-08'), ('image:image', OrderedDict([('image:loc', 'https://ichef.bbci.co.uk/food/ic/food_16x9_320/recipes/yorkshirepuddings_86010_16x9.jpg')]))]), OrderedDict([('loc', 'https://www.bbc.co.uk/food/recipes/yorkshirepuddingviap_9974'), ('lastmod', '2015-12-07')]), OrderedDict([('loc', 'https://www.bbc.co.uk/food/recipes/yorkshirepuddingwith_83703'), ('lastmod', '2018-10-30')]), OrderedDict([('loc', 'https://www.bbc.co.uk/food/recipes/yorkshirepudding_81824'), ('lastmod', '2019-01-21'), ('image:image', OrderedDict([('image:loc', 'https://ichef.bbci.co.uk/food/ic/food_16x9_320/recipes/yorkshirepudding_81824_16x9.jpg')]))]), OrderedDict([('loc', 'https://www.bbc.co.uk/food/recipes/yorkshirepudding_93848'), ('lastmod', '2018-08-08'), ('image:image', OrderedDict([('image:loc', 'https://ichef.bbci.co.uk/food/ic/food_16x9_320/recipes/yorkshirepudding_93848_16x9.jpg')]))]), OrderedDict([('loc', 'https://www.bbc.co.uk/food/recipes/yorkshire_curd_tart_20002'), ('lastmod', '2019-01-03'), ('image:image', OrderedDict([('image:loc', 'https://ichef.bbci.co.uk/food/ic/food_16x9_320/recipes/yorkshire_curd_tart_20002_16x9.jpg')]))]), OrderedDict([('loc', 'https://www.bbc.co.uk/food/recipes/yorkshire_curd_tart_23874'), ('lastmod', '2019-01-03')]), OrderedDict([('loc', 'https://www.bbc.co.uk/food/recipes/yorkshire_curd_tart_63644'), ('lastmod', '2016-09-19')]), OrderedDict([('loc', 'https://www.bbc.co.uk/food/recipes/yorkshire_oatmeal_parkin_13911'), ('lastmod', '2016-09-19')]), OrderedDict([('loc', 'https://www.bbc.co.uk/food/recipes/yorkshire_puddings_61798'), ('lastmod', '2018-11-28'), ('image:image', OrderedDict([('image:loc', 'https://ichef.bbci.co.uk/food/ic/food_16x9_320/recipes/yorkshire_puddings_61798_16x9.jpg')]))]), OrderedDict([('loc', 'https://www.bbc.co.uk/food/recipes/yorkshire_puddings_and_40867'), ('lastmod', '2018-12-04')]), OrderedDict([('loc', 'https://www.bbc.co.uk/food/recipes/yorkshire_puddings_with_15870'), ('lastmod', '2018-04-30')]), OrderedDict([('loc', 'https://www.bbc.co.uk/food/recipes/yorkshire_puddings_with_50889'), ('lastmod', '2019-02-11')]), OrderedDict([('loc', 'https://www.bbc.co.uk/food/recipes/yorkshire_pudding_69240'), ('lastmod', '2019-12-10'), ('image:image', OrderedDict([('image:loc', 'https://ichef.bbci.co.uk/food/ic/food_16x9_320/recipes/yorkshire_pudding_69240_16x9.jpg')]))]), OrderedDict([('loc', 'https://www.bbc.co.uk/food/recipes/yorkshire_pudding_wraps_73052'), ('lastmod', '2019-09-30'), ('image:image', OrderedDict([('image:loc', 'https://ichef.bbci.co.uk/food/ic/food_16x9_320/recipes/yorkshire_pudding_wraps_73052_16x9.jpg')]))]), OrderedDict([('loc', 'https://www.bbc.co.uk/food/recipes/yorkshire_tapas_puddings_93245'), ('lastmod', '2016-09-14'), ('image:image', OrderedDict([('image:loc', 'https://ichef.bbci.co.uk/food/ic/food_16x9_320/recipes/yorkshire_tapas_puddings_93245_16x9.jpg')]))]), [...]
I want to get only URLs, that are included in tag in XML and filter them according to the pattern "https://www.bbc.co.uk/food/recipes/"