How to select the correct html block in python requests response

Question

Using python requests and beautiful soup, how can I select the correct html block if multiple blocks may be returned in the response (or delete what I don't want)?

url = my_url + "cgi/interesting.cgi"
r = requests.get(url)
soup = BeautifulSoup(r.text, "lxml")
print (soup.prettify())

The first time this script is run against a target, the contents of r.text is:

 <html>
 <head>
  <script language="Javascript">
   top.topFrame.document.location.href="../cgi/navigation_frame.cgi";
   nothing to see here
  </script>
 </head>
</html>
<!-- cgi_interesting -->
<html>
 <head>
  <meta content="stuff"/>
  <link href="things"/>
 </head>
 <body bgcolor="#FFFFFF">
  <script language="Javascript">
  </script>
interesting content
</body>  
</html>

And the script returns (unintended):

 <html>
 <head>
  <script language="Javascript">
   top.topFrame.document.location.href="../cgi/navigation_frame.cgi";
   nothing to see here
  </script>
 </head>
</html>
<!-- cgi_interesting -->

If the script is called subsequently, the first block is absent and the interesting content is output; r.text looks like this:

<!-- cgi_interesting -->
<html>
 <head>
  <meta content="stuff"/>
  <link href="things"/>
 </head>
 <body bgcolor="#FFFFFF">
  <script language="Javascript">
  </script>
interesting content
</body>  
</html>

And the script returns (as intended):

<!-- cgi_interesting -->
<html>
 <head>
  <meta content="stuff"/>
  <link href="things"/>
 </head>
 <body bgcolor="#FFFFFF">
  <script language="Javascript">
  </script>
interesting content
</body>  
</html>

If the target hasn't been queried before, both blocks are present in r.text. It seems that beautifulsoup only handles the first block it finds.

I would like the code to work no matter whether the first block is present or not. How can I test r.text for multiple html blocks, select the appropriate one, and pass it to beautifulsoup?

I am currently investigating using re.sub to delete anything prior to  but is there a better way?

You say that you get the html blocks on different calls, then you say they are in the same r.text. So..., which? — tdelaney
– tdelaney, Commented Apr 12, 2018 at 3:42

tdelaney · Accepted Answer · 2018-04-12 04:56:38Z

2

That html is more invalid than beautifulsoup can deal with. Give a hand to whoever wrote such a buggy site! You could slice up the buffer at </html> boundaries and use soup multiple times:

url = my_url + "cgi/interesting.cgi"
r = requests.get(url)
content = r.content

html_blocks = []

# save declarations for all blocks
html_index = content.find(b'<html>')
if html_index >= 0:
    decl = content[:html_index]
    del content[:html_index]

    # find html extents
    while content:

        # find end tag
        extent = content.find(b'</html>')
        if extent >= 0:
            extent += len(b'</html>')
        else:
            # no end tag, hope BS figures it out
            extent = len(content)

        # put in list and delete from input
        html_blocks.append(delc + content[:extent]
        del content[:extent]

        # advance to next html tag
        html_index = content.find(b'<html>')
        if html_index == -1:
            html_index = len(content)
        del content[:html_index]


for block in html_blocks:
    soup = BeautifulSoup(block, "lxml")
    print (soup.prettify())

answered Apr 12, 2018 at 4:56

tdelaney

77.9k6 gold badges91 silver badges129 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

gloopy Over a year ago

I was looking at using re.sub on the  line as then I don't have to choose which block to process. But I'm having issues with that. I can use your solution to do the same thing. PS. it's an appliance; buggy site runs on flaky tcp stack!

gloopy Over a year ago

Runtime error at the first instance of del content[:html_index] "TypeError: 'bytes' object does not support item deletion". I tried with r.text for a similar error. Sounds odd, have I done something wrong?

tdelaney Over a year ago

That was me not checking whether it works. Do content = content[html_index:] for an explicit copy.

tdelaney Over a year ago

A split should would too. r.text.split("").

tdelaney Over a year ago

The problem with trying to get them all into a single doc is that there are multiple head and body tags as well. Once you've built the two html documents, you can then select the body of the second and copy its child elements into the body of the first.

gloopy · Accepted Answer · 2018-04-12 07:21:28Z

0

Rather than keep each block of html I used re.sub to get rid of anything prior to the html comment since it wasn't needed. Successfully finished looping through over 60 sites.

url = my_url + "cgi/interesting.cgi"
r = requests.get(url)
result = re.sub("(?s).*?(<!-- cgi_interesting -->)","\\1", r.text, 1, flags=re.DOTALL)
soup = BeautifulSoup(result, "lxml")
#soup = BeautifulSoup(r.text, "lxml")
print (soup.prettify())

answered Apr 12, 2018 at 7:21

gloopy

1132 silver badges9 bronze badges

Collectives™ on Stack Overflow

How to select the correct html block in python requests response

2 Answers 2

5 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

5 Comments

Comments

Your Answer

Sign up or log in

Post as a guest