3

I'm trying to extract the data on the crime rate across states from this webpage, link to web page http://www.disastercenter.com/crime/uscrime.htm

I am able to get this into text file. But I would like to get the response in Json format. How can I do this in python.

Here is my code:

import urllib        
import re     

from bs4 import BeautifulSoup    
link = "http://www.disastercenter.com/crime/uscrime.htm"    
f = urllib.urlopen(link)    
myfile = f.read()    
soup = BeautifulSoup(myfile)    
soup1=soup.find('table', width="100%")    
soup3=str(soup1)    
result = re.sub("<.*?>", "", soup3)    
print(result)    
output=open("output.txt","w")    
output.write(result)    
output.close()    
3
  • 2
    Your result is a long way from being json, what are you expecting as output? Commented May 14, 2015 at 21:42
  • 2
    Put the data in a useful Python data structure made of lists/dicts/strs/numbers, then use the json module. Commented May 14, 2015 at 21:48
  • @PadraicCunningham I am expecting table contents to be in form of JSON in an text file or even table data into csv would be great. Commented May 16, 2015 at 18:24

2 Answers 2

3

The following code will get the data from the two tables and output all of it as a json formatted string.


Working Example (Python 2.7.9):

from lxml import html
import requests
import re as regular_expression
import json

page = requests.get("http://www.disastercenter.com/crime/uscrime.htm")
tree = html.fromstring(page.text)

tables = [tree.xpath('//table/tbody/tr[2]/td/center/center/font/table/tbody'),
          tree.xpath('//table/tbody/tr[5]/td/center/center/font/table/tbody')]

tabs = []

for table in tables:
    tab = []
    for row in table:
        for col in row:
            var = col.text_content()
            var = var.strip().replace(" ", "")
            var = var.split('\n')
            if regular_expression.match('^\d{4}$', var[0].strip()):
                tab_row = {}
                tab_row["Year"] = var[0].strip()
                tab_row["Population"] = var[1].strip()
                tab_row["Total"] = var[2].strip()
                tab_row["Violent"] = var[3].strip()
                tab_row["Property"] = var[4].strip()
                tab_row["Murder"] = var[5].strip()
                tab_row["Forcible_Rape"] = var[6].strip()
                tab_row["Robbery"] = var[7].strip()
                tab_row["Aggravated_Assault"] = var[8].strip()
                tab_row["Burglary"] = var[9].strip()
                tab_row["Larceny_Theft"] = var[10].strip()
                tab_row["Vehicle_Theft"] = var[11].strip()
                tab.append(tab_row)
    tabs.append(tab)

json_data = json.dumps(tabs)

output = open("output.txt", "w")
output.write(json_data)
output.close()
Sign up to request clarification or add additional context in comments.

1 Comment

at the end of my webpage there are links to various states by name and by year. If I want even these as links in my JSON file how do I extract that?
1

This might be what you want, if you can use the requests and lxml modules. The data structure presented here is very simple, adjust this to your needs.

First, get a response from your requested URL and parse the result into an HTML tree:

import requests        
from lxml import etree
import json

response = requests.get("http://www.disastercenter.com/crime/uscrime.htm")
tree = etree.HTML(response.text)

Assuming you want to extract both tables, create this XPath and unpack the results. totals is "Number of Crimes" and rates is "Rate of Crime per 100,000 People":

xpath = './/table[@width="100%"][@style="background-color: rgb(255, 255, 255);"]//tbody'
totals, rates = tree.findall(xpath)

Extract the raw data (td.find('./') means first child item, whatever tag it has) and clean the strings (r'' raw strings are needed for Python 2.x):

raw_data = []
for tbody in totals, rates:
    rows = []
    for tr in tbody.getchildren():
        row = []
        for td in tr.getchildren():
            child = td.find('./')
            if child is not None and child.tag != 'br':
                row.append(child.text.strip(r'\xa0').strip(r'\n').strip())
            else:
                row.append('')
        rows.append(row)
    raw_data.append(rows)

Zip together the table headers in the first two rows, then delete the redundant rows, seen as the 11th & 12th steps in slice notation:

data = {}
data['tags'] = [tag0 + tag1 for tag0, tag1 in zip(raw_data[0][0], raw_data[0][1])]

for raw in raw_data:
    del raw[::12]
    del raw[::11]

Store the rest of the raw data and create a JSON file (optional: eliminate whitespace with separators=(',', ':')):

data['totals'], data['rates'] = raw_data[0], raw_data[1]
with open('data.json', 'w') as f:
    json.dump(data, f, separators=(',', ':'))

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.