9

I am fetching html source code of many pages from one website, I need to convert it into json object and combine with other elements in json doc. . I have seen many questions on same topic but non of them were helpful.

My code:

url = "https://totalhash.cymru.com/analysis/?1ce201cf28c6dd738fd4e65da55242822111bd9f"
htmlContent = requests.get(url, verify=False)
data = htmlContent.text
print("data",data)
jsonD = json.dumps(htmlContent.text)
jsonL = json.loads(jsonD)

ContentUrl='{ \"url\" : \"'+str(urls)+'\" ,'+"\n"+' \"uid\" : \"'+str(uniqueID)+'\" ,\n\"page_content\" : \"'+jsonL+'\" , \n\"date\" : \"'+finalDate+'\"}'

above code gives me unicode type, however, when I put that output in jsonLint it gives me invalid json error. Can somebody help me understand how can I convert the complete html into a json objet?

5
  • try using flask modules jsonify() method. Commented Apr 18, 2017 at 10:18
  • The source of the url does not returns json. To fetch element values from HTML, you need to use something like BeautifulSoup or lxml, etc Commented Apr 18, 2017 at 10:27
  • You are doing some very strange things here. Why would you dump to JSON, then immediately load, and then build up a JSON string manually? Commented Apr 18, 2017 at 10:29
  • @SatishGarg I am using beautiful soup for further processings but there I am trying to save the original html as well. Commented Apr 18, 2017 at 14:50
  • @DanielRoseman I am quite new to this so did not have much of an idea what I am doing I was just trying to make in json format. Commented Apr 18, 2017 at 14:50

2 Answers 2

19

jsonD = json.dumps(htmlContent.text) converts the raw HTML content into a JSON string representation. jsonL = json.loads(jsonD) parses the JSON string back into a regular string/unicode object. This results in a no-op, as any escaping done by dumps() is reverted by loads(). jsonL contains the same data as htmlContent.text.

Try to use json.dumps to generate your final JSON instead of building the JSON by hand:

ContentUrl = json.dumps({
    'url': str(urls),
    'uid': str(uniqueID),
    'page_content': htmlContent.text,
    'date': finalDate
})
Sign up to request clarification or add additional context in comments.

1 Comment

It worked like charm. Thanks for making my understanding better as well. I clicked on accept answer, but have no idea why it is not working
0

The correct way to convert HTML source code to a JSON file on the local system is as follows:

import json
import codecs

# Load the JSON file by specifying the location and filename
with codecs.open(filename="json_file.json", mode="r", encoding="utf-8") as jsonf:
    json_file = json.loads(jsonf.read())

# Load the HTML file by specifying the location and filename
with codecs.open(filename="html_file.html", mode='r', encoding="utf-8") as htmlf:
    html_file = htmlf.read()

# Chose the key name where the HTML source code will live as a string
json_file['Key1']['Key2'] = html_file

# Dump the dictionary to JSON object and save it in a specific location 
json_object = json.dumps(json_file, indent=4)
with codecs.open(filename="final_json_file.json", mode="w", encoding="utf-8") as ojsonf:
    ojsonf.write(json_object)
  • Next, open the JSON file in your editor.
  • Press CTRL + H, and replace \n or \t characters by '' (nothing!).
  • Now you can parse your HTML file with codecs.open() function and do the operations.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.