3

I am looking to parse through the following xml-http://charts.realclearpolitics.com/charts/1044.xml. I want to have the result in a data frame with 3 columns: Date, Approve, Disapprove. The xml file is dynamic in the sense that each day a new date is added, so the code should account for that. I have implemented a solution which is static i.e. I have to loop giving the value tag row numbers. I would like to learn how to implement it dynamically.

import numpy as np
import pandas as pd
import requests
from pattern import web

xml = requests.get('http://charts.realclearpolitics.com/charts/1044.xml').text
dom = web.Element(xml)
values = dom.by_tag('value')

date = []
approve = []
disapprove = []

values = dom.by_tag('value')
#The last range number below is 1720 instead of 1727 as last 6 values of Approve & Disapprove tag are blank. 
for i in range(0,1720):
    date.append(pd.to_datetime(values[i].content))

#The last range number below is 3447 instead of 3454 as last 6 values are blank. Including till 3454 will give error while converting to float. 
for i in range(1727,3447):
    a = float(values[i].content)
    approve.append(a)

#The last range number below is 5174 instead of 5181 as last 6 values are blank.
for i in range(3454,5174):
    a = float(values[i].content)
    disapprove.append(a)

finalresult = pd.DataFrame({'date': date, 'Approve': approve, 'Disapprove': disapprove})
finalresult
1
  • 1
    lxml has xpath support, which seems what you want. Then you can just get the elements out with an xpath command, no matter how many of them there are. Commented Oct 13, 2013 at 8:04

1 Answer 1

2

Here is one way to do it with lxml and XPath:

from lxml import etree
import pandas as pd

tree = etree.parse("http://charts.realclearpolitics.com/charts/1044.xml")

date = [s.text for s in tree.xpath("series/value")]
approve = [float(s.text) if s.text else 0.0
           for s in tree.xpath("graphs/graph[@title='Approve']/value")]
disapprove = [float(s.text) if s.text else 0.0
              for s in tree.xpath("graphs/graph[@title='Disapprove']/value")]

assert len(date) == len(approve) == len(disapprove)

finalresult = pd.DataFrame({'Date': date, 'Approve': approve, 'Disapprove': disapprove})
print finalresult

Output:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1727 entries, 0 to 1726
Data columns (total 3 columns):
Date          1727  non-null values
Approve       1727  non-null values
Disapprove    1727  non-null values
dtypes: float64(2), object(1)
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks for the code. It parses quite well. Also this has 1720 non-null values. But it contains the 7 'None' values at the end which makes operation like finalresult.Approve.sum() impossible?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.