0

I am trying to save my data to an XML file. This data comes from a website where I want to collect the reviews. There are always five reviews per page, which I want to save in XML format in a file. The problem is that if I print out the XML tree with print(ET.tostring(root, encoding='utf8').decode('utf8')) then there are all five reviews that I want to have. But if I save them into the file with tree.write("test.xml", encoding='unicode') then I only see one review... Here is my code:

import requests
from bs4 import BeautifulSoup
import re
import json
import xml.etree.cElementTree as ET

source = requests.get('https://www.tripadvisor.ch/Hotel_Review-g188113-d228146-Reviews-Coronado_Hotel-Zurich.html#REVIEWS').text

soup = BeautifulSoup(source, 'lxml')
pattern = re.compile(r'window.__WEB_CONTEXT__={pageManifest:(\{.*\})};')
script = soup.find("script", text=pattern)
dictData = pattern.search(script.text).group(1)
jsonData = json.loads(dictData)

def get_countrycitydata():

    countrycity_dict = dict()

    country_data = jsonData['urqlCache']['3960485871']['data']['locations']
    for data in country_data:
        data1 = data['parents']
        countrycity_dict["country_name"] = data1[2]['name']
        countrycity_dict["tripadvisorid_city"] = data1[0]['locationId']
        countrycity_dict["city_name"] = data1[0]['name']

    return countrycity_dict

def get_hoteldata():

    hotel_dict = dict()

    locations = jsonData['urqlCache']['669061039']['data']['locations']
    for data in locations:
        hotel_dict["tripadvisorid_hotel"] = data['locationId']
        hotel_dict["hotel_name"] = data['name']

    return hotel_dict

def get_reviews():  

    all_dictionaries = []

    for locations in jsonData['urqlCache']['669061039']['data']['locations']:
        for reviews in locations['reviewListPage']['reviews']:

            review_dict = {}

            review_dict["reviewid"] = reviews['id']
            review_dict["reviewurl"] =  reviews['absoluteUrl']
            review_dict["reviewlang"] = reviews['language']
            review_dict["reviewtitle"] = reviews['title']
            reviewtext = reviews['text']
            clean_reviewtext = reviewtext.replace('\n', ' ')
            review_dict["reviewtext"] = clean_reviewtext

            all_dictionaries.append(review_dict)

    return all_dictionaries

def xml_tree(new_dict): # should I change something here???

    root = ET.Element("countries")
    country = ET.SubElement(root, "country")

    ET.SubElement(country, "name").text = new_dict["country_name"]
    city = ET.SubElement(country, "city")

    ET.SubElement(city, "tripadvisorid").text = str(new_dict["tripadvisorid_city"])
    ET.SubElement(city, "name").text = new_dict["city_name"]
    hotels = ET.SubElement(city, "hotels")

    hotel = ET.SubElement(hotels, "hotel")
    ET.SubElement(hotel, "tripadvisorid").text = str(new_dict["tripadvisorid_hotel"])
    ET.SubElement(hotel, "name").text = new_dict["hotel_name"]
    reviews = ET.SubElement(hotel, "reviews")

    review = ET.SubElement(reviews, "review")
    ET.SubElement(review, "reviewid").text = str(new_dict["reviewid"])
    ET.SubElement(review, "reviewurl").text = new_dict["reviewurl"]
    ET.SubElement(review, "reviewlang").text = new_dict["reviewlang"]
    ET.SubElement(review, "reviewtitle").text = new_dict["reviewtitle"]
    ET.SubElement(review, "reviewtext").text = new_dict["reviewtext"]

    tree = ET.ElementTree(root)
    tree.write("test.xml", encoding='unicode')  

    print(ET.tostring(root, encoding='utf8').decode('utf8'))

##########################################################  

def main():

    city_dict = get_countrycitydata()
    hotel_dict = get_hoteldata()
    review_list = get_reviews()

    for index in range(len(review_list)):
        new_dict = {**city_dict, **hotel_dict, **review_list[index]}

        xml_tree(new_dict)

if __name__ == "__main__":
    main()  

How can I change the XML tree so that all five reviews are saved in the file? The XML file should look like this:

<countries>
    <country>
        <name>Schweiz</name>
        <city>
            <tripadvisorid>188113</tripadvisorid>
            <name>Zürich</name>
            <hotels>
                <hotel>
                    <tripadvisorid>228146</tripadvisorid>
                    <name>Hotel Coronado</name>
                    <reviews>
                        <review>
                            <reviewid>672052111</reviewid> 
                            <reviewurl>https://www.tripadvisor.ch/ShowUserReviews-g188113-d228146-r672052111-Coronado Hotel-Zurich.html</reviewurl>
                            <reviewlang>de</reviewlang>
                            <reviewtitle>Optimale Lage und Preis</reviewtitle>
                            <reviewtext>Hervorragendes Hotel.Beste Erfahrun mit Service und Zimme.Die Qalität der Betten ist optimalr. Zimmer sind trotz geringer Größe sehr gut ausgestattet.Der Föhn war in diesem Fall (nicht in früheren)etwas lahm</reviewtext>
                        </review>
                        <review>
                         second review here ...
                        </review>
                        <review>
                         third review here ...
                        </review>
                        ...
                    </reviews>
                </hotel>
            </hotels>
        </city>
    </country>
</countries>

Thank you in advance for all suggestions!

1
  • Probably need to append to your file, instead of overwriting it every time you open the file. It's no coincidence that the XML entry in your file is also the last entry in your XML tree. Commented Jan 6, 2020 at 14:36

1 Answer 1

2

Because your xml_tree(new_dict) exists inside of a for loop, the tree.write() method is being called multiple times overwriting your file.

Open your file in a (append) mode with open():

tree.write(open('test.xml', 'a'), encoding='unicode')

See documentation here

Sign up to request clarification or add additional context in comments.

1 Comment

Thank you for your answer! It worked, now I have five reviews in the file, but they are not inside the tag <reviews>... </reviews>. It appends every time the whole XML tree to the file. How can I append only reviewid, reviewurl, reviewlang, reviewtitle and reviewtext inside of <reviews> tag?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.