0

These are the rules

  1. The HTML tags will start with any of the following <p>, <ol> or <ul>
  2. The content of the HTML when any of step 1 tags is found will contain only the following tags: <em>, <strong> or <span style="text-decoration:underline">
  3. Map step two tags into the following: <strong> will be this item {"bold":True} in a JSON, <em> will {"italics":True} and <span style="text-decoration:underline"> will be {"decoration":"underline"}
  4. Any text found would be {"text": "this is the text"} in the JSON

Let’s say l have the HTML below: By using this:

soup = Soup("THIS IS THE WHOLE HTML", "html.parser")
allTags = [tag for tag in soup.find_all(recursive=False)]

Which produces this Array:

[
    <p>The name is not mine it is for the people<span style="text-decoration: underline;"><em><strong>stephen</strong></em></span><em><strong> how can</strong>name </em><strong>good</strong> <em>his name <span style="text-decoration: underline;">moneuet</span>please </em><span style="text-decoration: underline;"><strong>forever</strong></span><em>tomorrow<strong>USA</strong></em></p>,
    <p>2</p>,
    <p><strong>moment</strong><em>Africa</em> <em>China</em> <span style="text-decoration: underline;">home</span> <em>thomas</em> <strong>nothing</strong></p>,
    <ol><li>first item</li><li><em><span style="text-decoration: underline;"><strong>second item</strong></span></em></li></ol>
]

By Applying the rules above, this will be the result:

First Array element would be processed into this JSON:

{
    "text": [
        "The name is not mine it is for the people",
        {"text": "stephen", "decoration": "underline", "bold": True, "italics": True}, 
        {"text": "how can", "bold": True, "italics": True},
        {"text": "name", "italics": True},
        {"text": "good", "bold": True},
        {"text": "his name", "italics": True},
        {"text": "moneuet", "decoration": "underline"},
        {"text": "please ", "italics": True},
        {"text": "forever", "decoration": "underline", "bold":True},
        {"text": "tomorrow", "italics": True},
        {"text": "USA", "bold": True, "italics": True}
    ]
}

Second Array element would be processed into this JSON:

{"text": ["2"] }

Third Array element would be processed into this JSON:

{
    "text": [
        {"text": "moment", "bold": True},
        {"text": "Africa", "italics": True},
        {"text": "China", "italics": True},
        {"text": "home", "decoration": "underline"},
        {"text": "thomas", "italics": True},
        {"text": "nothing", "bold": True}
    ]
}

The fourth Array element would be processed into this JSON:

{
    "ol": [
        "first item", 
        {"text": "second item", "decoration": "underline", "italics": True, "bold": True}
    ]
}

This is my attempt so, l am able to drill down. But how to process arrayOfTextAndStyles array is the issue

soup = Soup("THIS IS THE WHOLE HTML", "html.parser")
allTags = [tag for tag in soup.find_all(recursive=False)]
for foundTag in allTags:
   foundTagStyles = [tag for tag in foundTag.find_all(recursive=True)]
      if len(foundTagStyles ) > 0:
         if str(foundTag.name) == "p":
              arrayOfTextAndStyles = [{"tag": tag.name, "text": 
                  foundTag.find_all(text=True, recursive=False) }] +  
                    [{"tag":tag.name, "text": foundTag.find_all(text=True, 
                    recursive=False) } for tag in foundTag.find_all()]

         elif  str(foundTag.name) == "ol":

         elif  str(foundTag .name) == "ul":
8
  • You need to come up with a more consistent output format; why is the second paragraph not resulting in a list, while the others all do? Why doesn't the third paragraph have an initial text element before all the nested dictionaries? Commented Sep 29, 2017 at 7:05
  • Alternatively, why not wrap all text in a dictionary? So for the first example, the first element would be {"text": "The name is not mine it is for the people"}. Commented Sep 29, 2017 at 7:07
  • @MartijnPieters have edited output 2 Commented Sep 29, 2017 at 7:07
  • Where did can go in your first example? How should <em><strong> how can</strong>name </em> be handled, really? It's a nested structure with text at two levels. Commented Sep 29, 2017 at 7:45
  • There is also a space between '<strong>good</strong>' and <em>his name ..., followed by more nesting. Commented Sep 29, 2017 at 7:46

1 Answer 1

1

I'd use a function to parse each element, not use one huge loop. Select on p and ol tags, and raise an exception in your parsing to flag anything that doesn't match your specific rules:

from bs4 import NavigableString

def parse(elem):
    if elem.name == 'ol':
        result = []
        for li in elem.find_all('li'):
            if len(li) > 1:
                result.append([parse_text(sub) for sub in li])
            else:
                result.append(parse_text(next(iter(li))))
        return {'ol': result}
    return {'text': [parse_text(sub) for sub in elem]}

def parse_text(elem):
    if isinstance(elem, NavigableString):
        return {'text': elem}

    result = {}
    if elem.name == 'em':
        result['italics'] = True
    elif elem.name == 'strong':
        result['bold'] = True
    elif elem.name == 'span':
        try:
            # rudimentary parse into a dictionary
            styles = dict(
                s.replace(' ', '').split(':') 
                for s in elem.get('style', '').split(';')
                if s.strip()
            )
        except ValueError:
            raise ValueError('Invalid structure')
        if 'underline' not in styles.get('text-decoration', ''):
            raise ValueError('Invalid structure')
        result['decoration'] = 'underline'
    else:
        raise ValueError('Invalid structure')

    if len(elem) > 1:
        result['text'] = [parse_text(sub) for sub in elem]
    else:
        result.update(parse_text(next(iter(elem))))
    return result

You then parse your document:

for candidate in soup.select('ol,p'):
    try:
        result = parse(candidate)
    except ValueError:
        # invalid structure, ignore
        continue
    print(result)

Using pprint, this results in:

{'text': [{'text': 'The name is not mine it is for the people'},
          {'bold': True,
           'decoration': 'underline',
           'italics': True,
           'text': 'stephen'},
          {'italics': True,
           'text': [{'bold': True, 'text': ' how can'}, {'text': 'name '}]},
          {'bold': True, 'text': 'good'},
          {'text': ' '},
          {'italics': True,
           'text': [{'text': 'his name '},
                    {'decoration': 'underline', 'text': 'moneuet'},
                    {'text': 'please '}]},
          {'bold': True, 'decoration': 'underline', 'text': 'forever'},
          {'italics': True,
           'text': [{'text': 'tomorrow'}, {'bold': True, 'text': 'USA'}]}]}
{'text': [{'text': '2'}]}
{'text': [{'bold': True, 'text': 'moment'},
          {'italics': True, 'text': 'Africa'},
          {'text': ' '},
          {'italics': True, 'text': 'China'},
          {'text': ' '},
          {'decoration': 'underline', 'text': 'home'},
          {'text': ' '},
          {'italics': True, 'text': 'thomas'},
          {'text': ' '},
          {'bold': True, 'text': 'nothing'}]}
{'ol': [{'text': 'first item'},
        {'bold': True,
         'decoration': 'underline',
         'italics': True,
         'text': 'second item'}]}

Note that the text nodes are now nested; this lets you consistently re-create the same structure, with correct whitespace and nested text decorations.

The structure is also reasonably consistent; a 'text' key will either point at a single string, or a list of dictionaries. Such a list will never mix types. You could improve on this still; have 'text' only point to a string, and use a different key to signify nested data, such as contains or nested or similar, then use just one or the other. All that'd require is changing the 'text' keys in len(elem) > 1 case and in the parse() function.

Sign up to request clarification or add additional context in comments.

6 Comments

I am testing, and l will get back soon. Thanks for your help
Is it possible to covert the entire result into valid json array, such as json.dumps(result) for the final result. The result looks very promising, however, the final output is not in a json format i.e. result
@ErnestAppiah: I've updated the answer to fix a small bug in handling nested elements with multiple children.
@ErnestAppiah: the final output is trivial to produce. Instead of print(result) in the last snippet (where I loop over soup.select('ol,p')), append the result to a list. Then use json.dumps(list_produced).
Thanks a lot man. You are awesome. How can l give you 500 stars. Thanks so much. I really appreciate your help
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.