Parse HTML table data to JSON and save to text file in Python 2.7

Question

I'm trying to extract the data on the crime rate across states from this webpage, link to web page http://www.disastercenter.com/crime/uscrime.htm

I am able to get this into text file. But I would like to get the response in Json format. How can I do this in python.

Here is my code:

import urllib        
import re     

from bs4 import BeautifulSoup    
link = "http://www.disastercenter.com/crime/uscrime.htm"    
f = urllib.urlopen(link)    
myfile = f.read()    
soup = BeautifulSoup(myfile)    
soup1=soup.find('table', width="100%")    
soup3=str(soup1)    
result = re.sub("<.*?>", "", soup3)    
print(result)    
output=open("output.txt","w")    
output.write(result)    
output.close()

Your result is a long way from being json, what are you expecting as output? — Padraic Cunningham
– Padraic Cunningham, Commented May 14, 2015 at 21:42
Put the data in a useful Python data structure made of lists/dicts/strs/numbers, then use the json module. — Stefan Pochmann
– Stefan Pochmann, Commented May 14, 2015 at 21:48
@PadraicCunningham I am expecting table contents to be in form of JSON in an text file or even table data into csv would be great. — Nick
– Nick, Commented May 16, 2015 at 18:24

jesterjunk · Accepted Answer · 2015-05-15 20:59:30Z

3

The following code will get the data from the two tables and output all of it as a json formatted string.

Working Example (Python 2.7.9):

from lxml import html
import requests
import re as regular_expression
import json

page = requests.get("http://www.disastercenter.com/crime/uscrime.htm")
tree = html.fromstring(page.text)

tables = [tree.xpath('//table/tbody/tr[2]/td/center/center/font/table/tbody'),
          tree.xpath('//table/tbody/tr[5]/td/center/center/font/table/tbody')]

tabs = []

for table in tables:
    tab = []
    for row in table:
        for col in row:
            var = col.text_content()
            var = var.strip().replace(" ", "")
            var = var.split('\n')
            if regular_expression.match('^\d{4}$', var[0].strip()):
                tab_row = {}
                tab_row["Year"] = var[0].strip()
                tab_row["Population"] = var[1].strip()
                tab_row["Total"] = var[2].strip()
                tab_row["Violent"] = var[3].strip()
                tab_row["Property"] = var[4].strip()
                tab_row["Murder"] = var[5].strip()
                tab_row["Forcible_Rape"] = var[6].strip()
                tab_row["Robbery"] = var[7].strip()
                tab_row["Aggravated_Assault"] = var[8].strip()
                tab_row["Burglary"] = var[9].strip()
                tab_row["Larceny_Theft"] = var[10].strip()
                tab_row["Vehicle_Theft"] = var[11].strip()
                tab.append(tab_row)
    tabs.append(tab)

json_data = json.dumps(tabs)

output = open("output.txt", "w")
output.write(json_data)
output.close()

edited May 15, 2015 at 20:59

answered May 15, 2015 at 2:48

jesterjunk

2,48224 silver badges20 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Nick Over a year ago

at the end of my webpage there are links to various states by name and by year. If I want even these as links in my JSON file how do I extract that?

rypel · Accepted Answer · 2015-05-15 13:33:03Z

This might be what you want, if you can use the requests and lxml modules. The data structure presented here is very simple, adjust this to your needs.

First, get a response from your requested URL and parse the result into an HTML tree:

import requests        
from lxml import etree
import json

response = requests.get("http://www.disastercenter.com/crime/uscrime.htm")
tree = etree.HTML(response.text)

Assuming you want to extract both tables, create this XPath and unpack the results. totals is "Number of Crimes" and rates is "Rate of Crime per 100,000 People":

xpath = './/table[@width="100%"][@style="background-color: rgb(255, 255, 255);"]//tbody'
totals, rates = tree.findall(xpath)

Extract the raw data (td.find('./') means first child item, whatever tag it has) and clean the strings (r'' raw strings are needed for Python 2.x):

raw_data = []
for tbody in totals, rates:
    rows = []
    for tr in tbody.getchildren():
        row = []
        for td in tr.getchildren():
            child = td.find('./')
            if child is not None and child.tag != 'br':
                row.append(child.text.strip(r'\xa0').strip(r'\n').strip())
            else:
                row.append('')
        rows.append(row)
    raw_data.append(rows)

Zip together the table headers in the first two rows, then delete the redundant rows, seen as the 11th & 12th steps in slice notation:

data = {}
data['tags'] = [tag0 + tag1 for tag0, tag1 in zip(raw_data[0][0], raw_data[0][1])]

for raw in raw_data:
    del raw[::12]
    del raw[::11]

Store the rest of the raw data and create a JSON file (optional: eliminate whitespace with separators=(',', ':')):

data['totals'], data['rates'] = raw_data[0], raw_data[1]
with open('data.json', 'w') as f:
    json.dump(data, f, separators=(',', ':'))

Collectives™ on Stack Overflow

Parse HTML table data to JSON and save to text file in Python 2.7

2 Answers 2

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related