1

I am attempting to web-scrape info off of the following website: https://www.axial.net/forum/companies/united-states-family-offices/

I am trying to scrape the description for each family office, so "https://www.axial.net/forum/companies/united-states-family-offices/"+insert_company_name" are the pages I need to scrape.

So I wrote the following code to test the program for just one page:

from bs4 import BeautifulSoup as soup
from selenium import webdriver
from selenium.webdriver.common.keys import Keys

driver = webdriver.Chrome('insert_path_here/chromedriver')
driver.get("https://network.axial.net/company/ansaco-llp")
page_source = driver.page_source
soup2 = soup(page_source,"html.parser")
soup2.findAll('axl-teaser-description')[0].text

This works for the single page, as long as the description doesn't have a "show full description" drop down button. I will save that for another question.

I wrote the following loop:

#Note: Lst2 has all the names for the companies. I made sure they match the webpage
lst3=[]
for key in lst2[1:]:
    driver.get("https://network.axial.net/company/"+key.lower())
    page_source = driver.page_source


    for handle in driver.window_handles:
         driver.switch_to.window(handle)
    word_soup = soup(page_source,"html.parser")



    if word_soup.findAll('axl-teaser-description') == []:
        lst3.append('null')
    else:
        c = word_soup.findAll('axl-teaser-description')[0].text
        lst3.append(c)
print(lst3)

When I run the loop, all of the values come out as "null", even the ones without "click for full description" buttons.

I edited the loop to instead print out "word_soup", and the page is different then if I had run it without a loop and does not have the description text.

I don't understand why a loop would cause that but apparently it does. Does anyone know how to fix this problem?

2
  • Your first example for ansaco-llp does not work for me. It does not find the axl-teaser-description element. Page_source does not reflect that element if you print it and check it. Commented Apr 16, 2020 at 23:47
  • @Sri Not sure why it doesn't work for you, but I found the solution, which I will post in the next comment. Commented Apr 17, 2020 at 0:13

2 Answers 2

1

Found solution. pause the program for 3 seconds after driver.get:

import time
lst3=[]
for key in lst2[1:]:
    driver.get("https://network.axial.net/company/"+key.lower())
    time.sleep(3)
    page_source = driver.page_source



    word_soup = soup(page_source,"html.parser")



    if word_soup.findAll('axl-teaser-description') == []:
        lst3.append('null')
    else:
        c = word_soup.findAll('axl-teaser-description')[0].text
        lst3.append(c)
print(lst3)
Sign up to request clarification or add additional context in comments.

Comments

0

I see that the page uses javascript to generate the text meaning it doesn't show up in the page source, which is weird but ok. I don't quite understand why you're only iterating through and switching to all the instances of Selenium you have open, but you definitely won't find the description in the page source / beautifulsoup.

Honestly, I'd personally look for a better website if you can, otherwise, you'll have to try it with selenium which is inefficient and horrible.

2 Comments

The window_handles loop was unnecessary, I changed it in the solution.
Right, I forgot browsers need to time to load a page while requests have that built-in and are virtually instant anyway.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.