Python Convert HTML into JSON using Soup

Question

These are the rules

The HTML tags will start with any of the following , <ol> or <ul>
The content of the HTML when any of step 1 tags is found will contain only the following tags: ,  or 
Map step two tags into the following:  will be this item {"bold":True} in a JSON,  will {"italics":True} and  will be {"decoration":"underline"}
Any text found would be {"text": "this is the text"} in the JSON

Let’s say l have the HTML below: By using this:

soup = Soup("THIS IS THE WHOLE HTML", "html.parser")
allTags = [tag for tag in soup.find_all(recursive=False)]

Which produces this Array:

[
    <p>The name is not mine it is for the people<span style="text-decoration: underline;"><em><strong>stephen</strong></em></span><em><strong> how can</strong>name </em><strong>good</strong> <em>his name <span style="text-decoration: underline;">moneuet</span>please </em><span style="text-decoration: underline;"><strong>forever</strong></span><em>tomorrow<strong>USA</strong></em></p>,
    <p>2</p>,
    <p><strong>moment</strong><em>Africa</em> <em>China</em> <span style="text-decoration: underline;">home</span> <em>thomas</em> <strong>nothing</strong></p>,
    <ol><li>first item</li><li><em><span style="text-decoration: underline;"><strong>second item</strong></span></em></li></ol>
]

By Applying the rules above, this will be the result:

First Array element would be processed into this JSON:

{
    "text": [
        "The name is not mine it is for the people",
        {"text": "stephen", "decoration": "underline", "bold": True, "italics": True}, 
        {"text": "how can", "bold": True, "italics": True},
        {"text": "name", "italics": True},
        {"text": "good", "bold": True},
        {"text": "his name", "italics": True},
        {"text": "moneuet", "decoration": "underline"},
        {"text": "please ", "italics": True},
        {"text": "forever", "decoration": "underline", "bold":True},
        {"text": "tomorrow", "italics": True},
        {"text": "USA", "bold": True, "italics": True}
    ]
}

Second Array element would be processed into this JSON:

{"text": ["2"] }

Third Array element would be processed into this JSON:

{
    "text": [
        {"text": "moment", "bold": True},
        {"text": "Africa", "italics": True},
        {"text": "China", "italics": True},
        {"text": "home", "decoration": "underline"},
        {"text": "thomas", "italics": True},
        {"text": "nothing", "bold": True}
    ]
}

The fourth Array element would be processed into this JSON:

{
    "ol": [
        "first item", 
        {"text": "second item", "decoration": "underline", "italics": True, "bold": True}
    ]
}

This is my attempt so, l am able to drill down. But how to process arrayOfTextAndStyles array is the issue

soup = Soup("THIS IS THE WHOLE HTML", "html.parser")
allTags = [tag for tag in soup.find_all(recursive=False)]
for foundTag in allTags:
   foundTagStyles = [tag for tag in foundTag.find_all(recursive=True)]
      if len(foundTagStyles ) > 0:
         if str(foundTag.name) == "p":
              arrayOfTextAndStyles = [{"tag": tag.name, "text": 
                  foundTag.find_all(text=True, recursive=False) }] +  
                    [{"tag":tag.name, "text": foundTag.find_all(text=True, 
                    recursive=False) } for tag in foundTag.find_all()]

         elif  str(foundTag.name) == "ol":

         elif  str(foundTag .name) == "ul":

You need to come up with a more consistent output format; why is the second paragraph not resulting in a list, while the others all do? Why doesn't the third paragraph have an initial text element before all the nested dictionaries? — Martijn Pieters
– Martijn Pieters, Commented Sep 29, 2017 at 7:05
Alternatively, why not wrap all text in a dictionary? So for the first example, the first element would be {"text": "The name is not mine it is for the people"}. — Martijn Pieters
– Martijn Pieters, Commented Sep 29, 2017 at 7:07
Where did can go in your first example? How should  how canname  be handled, really? It's a nested structure with text at two levels. — Martijn Pieters
– Martijn Pieters, Commented Sep 29, 2017 at 7:45
There is also a space between 'good' and his name ..., followed by more nesting. — Martijn Pieters
– Martijn Pieters, Commented Sep 29, 2017 at 7:46

Martijn Pieters · Accepted Answer · 2017-09-29 08:45:37Z

1

I'd use a function to parse each element, not use one huge loop. Select on p and ol tags, and raise an exception in your parsing to flag anything that doesn't match your specific rules:

from bs4 import NavigableString

def parse(elem):
    if elem.name == 'ol':
        result = []
        for li in elem.find_all('li'):
            if len(li) > 1:
                result.append([parse_text(sub) for sub in li])
            else:
                result.append(parse_text(next(iter(li))))
        return {'ol': result}
    return {'text': [parse_text(sub) for sub in elem]}

def parse_text(elem):
    if isinstance(elem, NavigableString):
        return {'text': elem}

    result = {}
    if elem.name == 'em':
        result['italics'] = True
    elif elem.name == 'strong':
        result['bold'] = True
    elif elem.name == 'span':
        try:
            # rudimentary parse into a dictionary
            styles = dict(
                s.replace(' ', '').split(':') 
                for s in elem.get('style', '').split(';')
                if s.strip()
            )
        except ValueError:
            raise ValueError('Invalid structure')
        if 'underline' not in styles.get('text-decoration', ''):
            raise ValueError('Invalid structure')
        result['decoration'] = 'underline'
    else:
        raise ValueError('Invalid structure')

    if len(elem) > 1:
        result['text'] = [parse_text(sub) for sub in elem]
    else:
        result.update(parse_text(next(iter(elem))))
    return result

You then parse your document:

for candidate in soup.select('ol,p'):
    try:
        result = parse(candidate)
    except ValueError:
        # invalid structure, ignore
        continue
    print(result)

Using pprint, this results in:

{'text': [{'text': 'The name is not mine it is for the people'},
          {'bold': True,
           'decoration': 'underline',
           'italics': True,
           'text': 'stephen'},
          {'italics': True,
           'text': [{'bold': True, 'text': ' how can'}, {'text': 'name '}]},
          {'bold': True, 'text': 'good'},
          {'text': ' '},
          {'italics': True,
           'text': [{'text': 'his name '},
                    {'decoration': 'underline', 'text': 'moneuet'},
                    {'text': 'please '}]},
          {'bold': True, 'decoration': 'underline', 'text': 'forever'},
          {'italics': True,
           'text': [{'text': 'tomorrow'}, {'bold': True, 'text': 'USA'}]}]}
{'text': [{'text': '2'}]}
{'text': [{'bold': True, 'text': 'moment'},
          {'italics': True, 'text': 'Africa'},
          {'text': ' '},
          {'italics': True, 'text': 'China'},
          {'text': ' '},
          {'decoration': 'underline', 'text': 'home'},
          {'text': ' '},
          {'italics': True, 'text': 'thomas'},
          {'text': ' '},
          {'bold': True, 'text': 'nothing'}]}
{'ol': [{'text': 'first item'},
        {'bold': True,
         'decoration': 'underline',
         'italics': True,
         'text': 'second item'}]}

Note that the text nodes are now nested; this lets you consistently re-create the same structure, with correct whitespace and nested text decorations.

The structure is also reasonably consistent; a 'text' key will either point at a single string, or a list of dictionaries. Such a list will never mix types. You could improve on this still; have 'text' only point to a string, and use a different key to signify nested data, such as contains or nested or similar, then use just one or the other. All that'd require is changing the 'text' keys in len(elem) > 1 case and in the parse() function.

edited Sep 29, 2017 at 8:45

answered Sep 29, 2017 at 8:09

Martijn Pieters

1.1m326 gold badges4.2k silver badges3.4k bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Ernest Appiah Over a year ago

I am testing, and l will get back soon. Thanks for your help

Ernest Appiah Over a year ago

Is it possible to covert the entire result into valid json array, such as json.dumps(result) for the final result. The result looks very promising, however, the final output is not in a json format i.e. result

Martijn Pieters Over a year ago

@ErnestAppiah: I've updated the answer to fix a small bug in handling nested elements with multiple children.

Martijn Pieters Over a year ago

@ErnestAppiah: the final output is trivial to produce. Instead of print(result) in the last snippet (where I loop over soup.select('ol,p')), append the result to a list. Then use json.dumps(list_produced).

Ernest Appiah Over a year ago

Thanks a lot man. You are awesome. How can l give you 500 stars. Thanks so much. I really appreciate your help

|

Collectives™ on Stack Overflow

Python Convert HTML into JSON using Soup

1 Answer 1

6 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

6 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related