0

Using python requests and beautiful soup, how can I select the correct html block if multiple blocks may be returned in the response (or delete what I don't want)?

url = my_url + "cgi/interesting.cgi"
r = requests.get(url)
soup = BeautifulSoup(r.text, "lxml")
print (soup.prettify())

The first time this script is run against a target, the contents of r.text is:

 <html>
 <head>
  <script language="Javascript">
   top.topFrame.document.location.href="../cgi/navigation_frame.cgi";
   nothing to see here
  </script>
 </head>
</html>
<!-- cgi_interesting -->
<html>
 <head>
  <meta content="stuff"/>
  <link href="things"/>
 </head>
 <body bgcolor="#FFFFFF">
  <script language="Javascript">
  </script>
interesting content
</body>  
</html>

And the script returns (unintended):

 <html>
 <head>
  <script language="Javascript">
   top.topFrame.document.location.href="../cgi/navigation_frame.cgi";
   nothing to see here
  </script>
 </head>
</html>
<!-- cgi_interesting -->

If the script is called subsequently, the first block is absent and the interesting content is output; r.text looks like this:

<!-- cgi_interesting -->
<html>
 <head>
  <meta content="stuff"/>
  <link href="things"/>
 </head>
 <body bgcolor="#FFFFFF">
  <script language="Javascript">
  </script>
interesting content
</body>  
</html>

And the script returns (as intended):

<!-- cgi_interesting -->
<html>
 <head>
  <meta content="stuff"/>
  <link href="things"/>
 </head>
 <body bgcolor="#FFFFFF">
  <script language="Javascript">
  </script>
interesting content
</body>  
</html>

If the target hasn't been queried before, both blocks are present in r.text. It seems that beautifulsoup only handles the first block it finds.

I would like the code to work no matter whether the first block is present or not. How can I test r.text for multiple html blocks, select the appropriate one, and pass it to beautifulsoup?

I am currently investigating using re.sub to delete anything prior to <!-- cgi_interesting --> but is there a better way?

4
  • You say that you get the html blocks on different calls, then you say they are in the same r.text. So..., which? Commented Apr 12, 2018 at 3:42
  • @tdelaney post edited for clarity I hope Commented Apr 12, 2018 at 4:29
  • Can you share the url? Commented Apr 12, 2018 at 4:33
  • @KeyurPotdar no sorry, it's on an internal network. Commented Apr 12, 2018 at 4:47

2 Answers 2

2

That html is more invalid than beautifulsoup can deal with. Give a hand to whoever wrote such a buggy site! You could slice up the buffer at </html> boundaries and use soup multiple times:

url = my_url + "cgi/interesting.cgi"
r = requests.get(url)
content = r.content

html_blocks = []

# save declarations for all blocks
html_index = content.find(b'<html>')
if html_index >= 0:
    decl = content[:html_index]
    del content[:html_index]

    # find html extents
    while content:

        # find end tag
        extent = content.find(b'</html>')
        if extent >= 0:
            extent += len(b'</html>')
        else:
            # no end tag, hope BS figures it out
            extent = len(content)

        # put in list and delete from input
        html_blocks.append(delc + content[:extent]
        del content[:extent]

        # advance to next html tag
        html_index = content.find(b'<html>')
        if html_index == -1:
            html_index = len(content)
        del content[:html_index]


for block in html_blocks:
    soup = BeautifulSoup(block, "lxml")
    print (soup.prettify())
Sign up to request clarification or add additional context in comments.

5 Comments

I was looking at using re.sub on the <!-- cgi_interesting --> line as then I don't have to choose which block to process. But I'm having issues with that. I can use your solution to do the same thing. PS. it's an appliance; buggy site runs on flaky tcp stack!
Runtime error at the first instance of del content[:html_index] "TypeError: 'bytes' object does not support item deletion". I tried with r.text for a similar error. Sounds odd, have I done something wrong?
That was me not checking whether it works. Do content = content[html_index:] for an explicit copy.
A split should would too. r.text.split("<!-- cgi_interesting -->").
The problem with trying to get them all into a single doc is that there are multiple head and body tags as well. Once you've built the two html documents, you can then select the body of the second and copy its child elements into the body of the first.
0

Rather than keep each block of html I used re.sub to get rid of anything prior to the html comment since it wasn't needed. Successfully finished looping through over 60 sites.

url = my_url + "cgi/interesting.cgi"
r = requests.get(url)
result = re.sub("(?s).*?(<!-- cgi_interesting -->)","\\1", r.text, 1, flags=re.DOTALL)
soup = BeautifulSoup(result, "lxml")
#soup = BeautifulSoup(r.text, "lxml")
print (soup.prettify())

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.