Parse through an xml in python

Question

I am looking to parse through the following xml-http://charts.realclearpolitics.com/charts/1044.xml. I want to have the result in a data frame with 3 columns: Date, Approve, Disapprove. The xml file is dynamic in the sense that each day a new date is added, so the code should account for that. I have implemented a solution which is static i.e. I have to loop giving the value tag row numbers. I would like to learn how to implement it dynamically.

import numpy as np
import pandas as pd
import requests
from pattern import web

xml = requests.get('http://charts.realclearpolitics.com/charts/1044.xml').text
dom = web.Element(xml)
values = dom.by_tag('value')

date = []
approve = []
disapprove = []

values = dom.by_tag('value')
#The last range number below is 1720 instead of 1727 as last 6 values of Approve & Disapprove tag are blank. 
for i in range(0,1720):
    date.append(pd.to_datetime(values[i].content))

#The last range number below is 3447 instead of 3454 as last 6 values are blank. Including till 3454 will give error while converting to float. 
for i in range(1727,3447):
    a = float(values[i].content)
    approve.append(a)

#The last range number below is 5174 instead of 5181 as last 6 values are blank.
for i in range(3454,5174):
    a = float(values[i].content)
    disapprove.append(a)

finalresult = pd.DataFrame({'date': date, 'Approve': approve, 'Disapprove': disapprove})
finalresult

lxml has xpath support, which seems what you want. Then you can just get the elements out with an xpath command, no matter how many of them there are. — Lennart Regebro
– Lennart Regebro, Commented Oct 13, 2013 at 8:04

mzjn · Accepted Answer · 2013-10-13 12:05:18Z

2

Here is one way to do it with lxml and XPath:

from lxml import etree
import pandas as pd

tree = etree.parse("http://charts.realclearpolitics.com/charts/1044.xml")

date = [s.text for s in tree.xpath("series/value")]
approve = [float(s.text) if s.text else 0.0
           for s in tree.xpath("graphs/graph[@title='Approve']/value")]
disapprove = [float(s.text) if s.text else 0.0
              for s in tree.xpath("graphs/graph[@title='Disapprove']/value")]

assert len(date) == len(approve) == len(disapprove)

finalresult = pd.DataFrame({'Date': date, 'Approve': approve, 'Disapprove': disapprove})
print finalresult

Output:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1727 entries, 0 to 1726
Data columns (total 3 columns):
Date          1727  non-null values
Approve       1727  non-null values
Disapprove    1727  non-null values
dtypes: float64(2), object(1)

edited Oct 13, 2013 at 12:05

answered Oct 13, 2013 at 10:56

mzjn

51.5k16 gold badges139 silver badges265 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

PronojitS Over a year ago

Thanks for the code. It parses quite well. Also this has 1720 non-null values. But it contains the 7 'None' values at the end which makes operation like finalresult.Approve.sum() impossible?

Collectives™ on Stack Overflow

Parse through an xml in python

1 Answer 1

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related