2

I wanted to use python webscraping to feed an ml application I did that would make a summary of summaries to ease my daily research work. I seem to meet some difficulties as while I have been using a lot of suggestions on the web, such as this one:
Python Selenium accessing HTML source I keep getting the AttributeError: 'NoneType' object has no attribute 'page_source'/'content' depending on the tries/used modules I need this source to feed beautiful soup to scrape the source and find my ml script. My first attempt was to use requests:

from bs4 import BeautifulSoup as BS
import requests
import time
import datetime
print ('start!')
print(datetime.datetime.now())

page="http://www.genecards.org/cgi-bin/carddisp.pl?gene=COL1A1&keywords=COL1A1"

This is my target page. I usually do like 20 requests a day, so it's not like I wanted to vampirize the website, and since I need them at the same moment, I wanted to automate the retrieval task since the longest part is to get the url, load it, copy and paste the summaries. I am also reasonnable since I respect some delays before loading another page. I tried passing as a regular browser since the site doesn't like robots (it disallows /ProductRedirect and a thing with a number I could not find in google?)

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:44.0) Gecko/20100101 Firefox/44.0'}
current_page = requests.get(page,  headers=headers)
print(current_page)
print(current_page.content)
soup=BS(current_page.content,"lxml")

I always end up getting no content, while request get code 200 and I can load this page by myself in firefox. So i tried with Selenium

from bs4 import BeautifulSoup as BS
from selenium import webdriver
import time
import datetime
print ('start!')
print(datetime.datetime.now())

browser = webdriver.Firefox()
current_page =browser.get(page)
time.sleep(10)

this works and loads a page. I added the delay to be sure not to spam the host and to be sure to fully load the page. then neither:

html=current_page.content

nor

html=current_page.page_source

nor

html=current_page

works as an input for:

soup=BS(html,"lxml")

It always ends up saying that it doesn't have the page_source attribute (while it should have since it loads correctly in the selenium invoked web browser window).

I don't know what to try next. It's like the user-agent header was not working for requests, and it is very strange that selenium returned page has no source.

What could I try next? Thanks.

Note that I also tried:

browser.get(page)
time.sleep(8)
print(browser)
print(browser.page_source)
html=browser.page_source
soup=BS(html,"lxml")
for summary in soup.find('section', attrs={'id':'_summaries'})
    print(summary)

but while it can get the source, it just fails at BS stage with ; "AttributeError: 'NoneType' object has no attribute 'find'"

2
  • 1
    have you tried the regular html parser? soup = BS(html, "html.parser") Commented Mar 3, 2016 at 16:21
  • I just did. I use lxml because they recommend it. Anyway html.parser still get NoneType' object has no attribute 'find'". I am trying new things from the last solution which is able to print the source, but I don't get why BS still doesn't want to parse it, once the robot thing seems to be passed... Commented Mar 3, 2016 at 16:37

2 Answers 2

2

The problem is that you are trying to iterate over the result of .find(). Instead you need .find_all():

for summary in soup.find_all('section', attrs={'id':'_summaries'})
    print(summary)

Or, if there is a single element, don't use a loop:

summary = soup.find('section', attrs={'id':'_summaries'})
print(summary)
Sign up to request clarification or add additional context in comments.

5 Comments

Ok, it seems better, indeed. Thanks. It is not quite the original question, but apart of the syntax error, I tried to iterate over the find_all iterable because I need to get a certain section this "_summaries" id, then scrape it again for its parts (between <p> tags that are filling the document and are all over the place). Have you a suggestion for that? Is nested BS objects a good practice or can I achieve that in one command?
@AndoJurai sure, that summary variable inside a loop is a BS Tag instance - you can search inside it as with a regular soup object: [p.get_text() for p in summary.find_all("p")] for instance. Hope that helps.
Thanks. actually soup.find_all("p") works (while getting some text that I don't want), but neither summary.find_all("p"), neither soup.summary.find_all("p") do. as the section I am interested in is <section id="_summaries" data-ga-label="Summaries" data-section="Summaries">; i also tried _summaries, summaries, Summaries, as an identifier, but all of these come back with this attribute error. using for a in soup.find_all(re.compile("Summ")) gets nothing while for a in soup.find_all(re.compile("section")) gets too many things. I can't really wrap my head around BS workings...
@AndoJurai could you please elaborate that in a separate question providing the current code you have, the HTML source of the page and point what problems are you experiencing? Thanks!
Yes, I am going to do that, it will be the best way. Thanks
1

You shouldn't have to convert the html to a string object.

Try:

html = browser.page_source
soup = BS(html,"lxml")

1 Comment

Yes, actually this works, the str thing was one of my tries. I still don't really understand why you can't assign browser.get(page) to an object, then ask it the page_source, for me it's puzzling. I'm really not familiar with this kind of object management; it's different from using constructors and so on.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.