0

I've seen several examples of this online on how to convert HTML content to JSON, but I'm unable to get to an actual result.

Suppose I have the following html_content:

<html>
    <body>
        <h1>My Heading</h1>
        <p>Hello world</p>
        <table>
            <tr>
                <th>Name</th>
                <th>Age</th>
                <th>License</th>
                <th>Amount</th>
            </tr>
            <tr>
                <td>John</td>
                <td>28</td>
                <td>Y</td>
                <td>12.30</td>
            </tr>
            <tr>
                <td>Kevin</td>
                <td>25</td>
                <td>Y</td>
                <td>22.30</td>
            </tr>
            <tr>
                <td>Smith</td>
                <td>38</td>
                <td>Y</td>
                <td>52.20</td>
            </tr>
            <tr>
                <td>Stewart</td>
                <td>21</td>
                <td>N</td>
                <td>3.80</td>
            </tr>
        </table>
    </body>
</html>

As you can see, this contains a heading, paragraph and table elements. I am trying to convert the above to JSON and output the result to a separate file, with correct formatting. This is my code:

import sys
import json
jsonD = json.dumps(html_content, sort_keys=True, indent=4)

sys.stdout=open("output.json","w")
print (jsonD)
sys.stdout.close()

The result is:

"\n<html>\n\t<body>\n\t\t<h1>My Heading</h1>\n\t\t<p>Hello world</p>\n\t\t<table>\n\t\t\t<tr>\n\t\t\t\t<th>Name</th>\n\t\t\t\t<th>Age</th>\n\t\t\t\t<th>License</th>\n\t\t\t\t<th>Amount</th>\n\t\t\t</tr>\n\t\t\t<tr>\n\t\t\t\t<td>John</td>\n\t\t\t\t<td>28</td>\n\t\t\t\t<td>Y</td>\n\t\t\t\t<td>12.30</td>\n\t\t\t</tr>\n\t\t\t<tr>\n\t\t\t\t<td>Kevin</td>\n\t\t\t\t<td>25</td>\n\t\t\t\t<td>Y</td>\n\t\t\t\t<td>22.30</td>\n\t\t\t</tr>\n\t\t\t<tr>\n\t\t\t\t<td>Smith</td>\n\t\t\t\t<td>38</td>\n\t\t\t\t<td>Y</td>\n\t\t\t\t<td>52.20</td>\n\t\t\t</tr>\n\t\t\t<tr>\n\t\t\t\t<td>Stewart</td>\n\t\t\t\t<td>21</td>\n\t\t\t\t<td>N</td>\n\t\t\t\t<td>3.80</td>\n\t\t\t</tr>\n\t\t</table>\n\t</body>\n</html>\n"

As you can see, the result is not escaping any of the return or tab characters and is making the output seem like one long string. How can I rectify this so that the output is correctly formatting from a JSON perspective?

8
  • What output are you expecting? Commented Feb 19, 2020 at 16:04
  • This might be a helpful example to look at: xavierdupre.fr/blog/2013-10-27_nojs.html Commented Feb 19, 2020 at 16:10
  • @ZacharyBlackwood I've seen this example, but how do you import the HTMLtoJSONParser module? Commented Feb 19, 2020 at 16:13
  • @AlexW similar to the output I've put but without the "/n" and "/t" in between each element. Instead, it should actually return to a new line or indent as it's written. Commented Feb 19, 2020 at 16:14
  • @Adam In the case of that blog post, he actually created the HTMLtoJSONParser, it's not something he imported from somewhere else Commented Feb 19, 2020 at 16:15

2 Answers 2

2

You need to know how you want your json output to look like. If you want the names to be the keys, and the values be the list of everything else, I would do something like:

from bs4 import BeautifulSoup
import json

html_content = """
<table>
    <tr>
        <td>John</td>
        <td>28</td>
        <td>Y</td>
        <td>12.30</td>
    </tr>
    <tr>
        <td>Kevin</td>
        <td>25</td>
        <td>Y</td>
        <td>22.30</td>
    </tr>
    <tr>
        <td>Smith</td>
        <td>38</td>
        <td>Y</td>
        <td>52.20</td>
    </tr>
    <tr>
        <td>Stewart</td>
        <td>21</td>
        <td>N</td>
        <td>3.80</td>
    </tr>
</table>
<h1> hello world <h1>
<table>
    <tr>
        <td>Jack</td>
        <td>1</td>
    </tr>
    <tr>
        <td>Joe</td>
        <td>2</td>
    </tr>
    <tr>
        <td>Bill</td>
        <td>3</td>
    </tr>
    <tr>
        <td>Sam</td>
        <td>4</td>
    </tr>
</table>
"""

html_content_parsed = [[cell.text for cell in row("td")]
                         for row in BeautifulSoup(html_content,features="html.parser")("tr")]

html_content_dictionary = {element[0]:element[1:] for element in html_content_parsed}

print(json.dumps(html_content_dictionary, indent=4))

As you can see, this will ignore other non-table elements and puts all the tables into json.

htmltojson_program_output

You can try out the program here: https://repl.it/@Mandawi/htmltojson

Sign up to request clarification or add additional context in comments.

6 Comments

Thank you. I have seen the same response here: stackoverflow.com/a/59968204/3480297 but this doesn't work when there are multiple tables or different elements other than "table" in the html. Do you know how the resolve that?
Yes, same idea!
What if there are multiple tables to the html_content? That only displays the first table for me.
I don't know what you mean by elements other than table. Do you want to put these elements in json as well? If you don't, then this will simply ignore them. If you do, then parse them the way you want them to look in json.
Sorry, I think the other elements as you mentioned can be formatted. But if there are multiple tables, the JSON outputs only the first table. Could you try that and see if the same happens to you?
|
0

There is a library to convert html to json here (full disclosure: I am the author of this library). This library can convert HTML to JSON and has a specific function to convert only HTML tables to JSON (you give it HTML and it will find all tables and convert them to JSON).

For your specific use-case you can install the html-to-json library (see instructions here) and then run this:

import html_to_json

import html_to_json
s = '''<html>
    <body>
        <h1>My Heading</h1>
        <p>Hello world</p>
        <table>
            <tr>
                <th>Name</th>
                <th>Age</th>
                <th>License</th>
                <th>Amount</th>
            </tr>
            <tr>
                <td>John</td>
                <td>28</td>
                <td>Y</td>
                <td>12.30</td>
            </tr>
            <tr>
                <td>Kevin</td>
                <td>25</td>
                <td>Y</td>
                <td>22.30</td>
            </tr>
            <tr>
                <td>Smith</td>
                <td>38</td>
                <td>Y</td>
                <td>52.20</td>
            </tr>
            <tr>
                <td>Stewart</td>
                <td>21</td>
                <td>N</td>
                <td>3.80</td>
            </tr>
        </table>
    </body>
</html>'''

html_to_json.convert_tables(s)

As you can see in the output below, the html-to-json library uses the <th> elements (if available) as the keys for the output JSON:

[
  [
    {
      "Name": "John",
      "Age": "28",
      "License": "Y",
      "Amount": "12.30"
    },
    {
      "Name": "Kevin",
      "Age": "25",
      "License": "Y",
      "Amount": "22.30"
    },
    {
      "Name": "Smith",
      "Age": "38",
      "License": "Y",
      "Amount": "52.20"
    },
    {
      "Name": "Stewart",
      "Age": "21",
      "License": "N",
      "Amount": "3.80"
    }
  ]
]

If you wanted to convert the entire HTML (and not just the table), you can replace html_to_json.convert_tables(s) with html_to_json.convert(s).

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.