Using python requests and beautiful soup, how can I select the correct html block if multiple blocks may be returned in the response (or delete what I don't want)?
url = my_url + "cgi/interesting.cgi"
r = requests.get(url)
soup = BeautifulSoup(r.text, "lxml")
print (soup.prettify())
The first time this script is run against a target, the contents of r.text is:
<html>
<head>
<script language="Javascript">
top.topFrame.document.location.href="../cgi/navigation_frame.cgi";
nothing to see here
</script>
</head>
</html>
<!-- cgi_interesting -->
<html>
<head>
<meta content="stuff"/>
<link href="things"/>
</head>
<body bgcolor="#FFFFFF">
<script language="Javascript">
</script>
interesting content
</body>
</html>
And the script returns (unintended):
<html>
<head>
<script language="Javascript">
top.topFrame.document.location.href="../cgi/navigation_frame.cgi";
nothing to see here
</script>
</head>
</html>
<!-- cgi_interesting -->
If the script is called subsequently, the first block is absent and the interesting content is output; r.text looks like this:
<!-- cgi_interesting -->
<html>
<head>
<meta content="stuff"/>
<link href="things"/>
</head>
<body bgcolor="#FFFFFF">
<script language="Javascript">
</script>
interesting content
</body>
</html>
And the script returns (as intended):
<!-- cgi_interesting -->
<html>
<head>
<meta content="stuff"/>
<link href="things"/>
</head>
<body bgcolor="#FFFFFF">
<script language="Javascript">
</script>
interesting content
</body>
</html>
If the target hasn't been queried before, both blocks are present in r.text. It seems that beautifulsoup only handles the first block it finds.
I would like the code to work no matter whether the first block is present or not. How can I test r.text for multiple html blocks, select the appropriate one, and pass it to beautifulsoup?
I am currently investigating using re.sub to delete anything prior to <!-- cgi_interesting --> but is there a better way?
r.text. So..., which?