How to extract data from XML with Python

Question

I‘m trying to extract the xml (http://py4e-data.dr-chuck.net/comments_42.xml） which looks like:

<note>
<comments>
<comment>
<name>Romina</name>
<count>97</count>
</comment>
...

I need to count the number of tags and sum up the value in the tags, finally print them out.

I have tried to extract and parse the xml based on the sample code given but I also made some changes.

Please see my code:

import urllib.request, urllib.parse, urllib.error
import xml.etree.ElementTree as ET
import ssl

api_key = False

if api_key is False:
api_key = 42
serviceurl = 'http://py4e-data.dr-chuck.net/xml?'

# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

while True:
address = input('Enter location: ')
if len(address) < 1: break

url = serviceurl + urllib.parse.urlencode(address)
uh = urllib.request.urlopen(url, context=ctx)

data = uh.read()
print('Retrieved', len(data), 'characters')
tree = ET.fromstring(data)

count = 0
sum = 0
lst = tree.findall('comments/comment')
for item in lst:
    value = int(item.find('count'.text))
    count = count+1
    sum = sum + value
    print('Count:',count)
    print('Sum:',sum)

I expect to get the count and sum of values, but the terminal said the "serviceurl" is invalid.

Can you give me a sample input for 'Enter location: '? Also, I think you meant to indent the two statements in your "while True:" — jhelphenstine
– jhelphenstine, Commented Jun 21, 2019 at 14:17
Oh yes thanks for the heads-up! I forgot to give you the link and the expected result. Now it's solved :) — Daisy CHEN
– Daisy CHEN, Commented Jun 22, 2019 at 9:21

jhelphenstine · Accepted Answer · 2019-06-21 14:31:13Z

1

I modified your code and achieved your goal of summing the values and delivering the count. I'm not sure if this is the right answer, though, because I can't tell if you're inheriting the 'enter location', or 'api_key' from sample code or if it's something you're trying to specifically accomplish.

Also, I assume you meant to use 'sum' instead of 'value' in your for loop, and store an increasing sum.

import urllib.request, urllib.parse, urllib.error
import xml.etree.ElementTree as ET
import ssl

api_key = False

if api_key is False:
        api_key = 42
        serviceurl = 'http://py4e-data.dr-chuck.net/comments_42.xml'

# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

#while True:
#       address = input('Enter location: ')
#       if len(address) < 1: break

url = serviceurl #+ urllib.parse.urlencode(address)
uh = urllib.request.urlopen(url, context=ctx)

data = uh.read()
print('Retrieved', len(data), 'characters')
tree = ET.fromstring(data)

count = 0
sum = 0
lst = tree.findall('comments/comment')
for item in lst:
    sum = sum + int(item.find('count').text)
    count = count+1

print("Sum: ", sum, "Count: ", count)

I achieved the output:

Retrieved 4189 characters
Sum:  2553 Count:  50

I commented out some portions of your code to make it work -- are there other constraints that prohibit directly reading the data?

answered Jun 21, 2019 at 14:31

jhelphenstine

4232 silver badges10 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Daisy CHEN Over a year ago

Thank you so much! I solved it and the issue was about "serviceurl" indeed. I should've make an "input('Enter --') for "serviceurl" and link the “url" to it. But if possible, may I ask what's the function of the "urlib.parse.urlencode(address)" after it?

jhelphenstine Over a year ago

That's used to encode strings for rendering in URLs. The '%20' and '+' you see in long URLs is a result of two different ways to do encoding, because you can't send plaintext spaces in URLs. There's additional reference material showing how urllib encodes parameters to build strings.

Andrew Chan · Accepted Answer · 2021-07-03 11:17:22Z

0

Try this instead, I inputted the sample link http://py4e-data.dr-chuck.net/comments_42.xml and yielded the desired result of 2553.

import urllib.request, urllib.parse, urllib.error
import xml.etree.ElementTree as ET
import ssl

ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

address = input('Enter location: ')

print('Retrieving', address)
uh = urllib.request.urlopen(address, context=ctx)

data = uh.read()
tree = ET.fromstring(data)

results = tree.findall('comments/comment')
print('Comment count:', len(results))
x=[]
for item in results: 
    x.append(int(item.find('count').text))
print(x)
print(sum(x))

I have removed the below lines of codes and it worked. I hypothesize that it's because the serviceurl is invalid. Indeed in my codes above the address worked without the serviceurl, so it would be logical to conclude that the serviceurl is at least unnecessary.

if api_key is False:
    api_key = 42
    serviceurl = 'http://py4e-data.dr-chuck.net/xml?'

and

url = serviceurl + urllib.parse.urlencode(address)
uh = urllib.request.urlopen(url, context=ctx)

edited Jul 3, 2021 at 11:17

answered Jul 2, 2021 at 5:37

Andrew Chan

11 bronze badge

1 Comment

user14887424 Over a year ago

Hello there, please fix indentation with second codeblock

Collectives™ on Stack Overflow

How to extract data from XML with Python

2 Answers 2

2 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related