0

I would like to extract the market information from the following url and all of its subsequent pages:

https://uk.reuters.com/investing/markets/index/.FTSE?sortBy=&sortDir=&pn=1

I have successfully parsed the data that I want from the first page using some code from the following url:

https://impythonist.wordpress.com/2015/01/06/ultimate-guide-for-scraping-javascript-rendered-web-pages

I have also been able to parse out the url for the next page to feed into a loop in order to grab data from the next page. The problem is it crashes before the next page loads for a reason I don't fully understand.

I have a hunch that the class that I have borrowed from 'impythonist' may be causing the problem. I don't know enough object orientated programming to work out the problem. Here is my code, much of which is borrowed from the the url above:

import sys  
from PyQt4.QtGui import *  
from PyQt4.QtCore import *  
from PyQt4.QtWebKit import *  
from lxml import html
import re
from bs4 import BeautifulSoup

class Render(QWebPage):  
  def __init__(self, url):  
    self.app = QApplication(sys.argv)  
    QWebPage.__init__(self)  
    self.loadFinished.connect(self._loadFinished)  
    self.mainFrame().load(QUrl(url))  
    self.app.exec_()  

  def _loadFinished(self, result):  
    self.frame = self.mainFrame()  
    self.app.quit()  



base_url='https://uk.reuters.com'
complete_next_page='https://uk.reuters.com/investing/markets/index/.FTSE?sortBy=&sortDir=&pn=1'

#LOOP TO RENDER PAGES AND GRAB DATA
while complete_next_page != '':
    print ('NEXT PAGE: ',complete_next_page, '\n')
    r = Render(complete_next_page)  # USE THE CLASS TO RENDER JAVASCRIPT FROM PAGE
    result = r.frame.toHtml()     # ERROR IS THROWN HERE ON 2nd PAGE

# PARSE THE HTML
soup = BeautifulSoup(result, 'lxml')
row_data=soup.find('div', attrs={'class':'column1 gridPanel grid8'})
print (len(row_data))

# PARSE ALL ROW DATA
stripe_rows=row_data.findAll('tr', attrs={'class':'stripe'})
non_stripe_rows=row_data.findAll('tr', attrs={'class':''})
print (len(stripe_rows))
print (len(non_stripe_rows))

# PARSE SPECIFIC ROW DATA FROM INDEX COMPONENTS
#non_stripe_rows: from 4 to 18 (inclusive) contain data
#stripe_rows: from 2 to 16 (inclusive) contain data
i=2
while i < len(stripe_rows):
    print('CURRENT LINE IS: ',str(i))
    print(stripe_rows[i])
    print('###############################################')
    print(non_stripe_rows[i+2])
    print('\n')
    i+=1

#GETS LINK TO NEXT PAGE
next_page=str(soup.find('div', attrs={'class':'pageNavigation'}).find('li', attrs={'class':'next'}).find('a')['href']) #GETS LINK TO NEXT PAGE WORKS
complete_next_page=base_url+next_page

I have annotated the bits of code that I have written and understand but I don't really know what's going on in the 'Render' class enough to diagnose the error? Unless its something else?

Here is the error:

result = r.frame.toHtml()
AttributeError: 'Render' object has no attribute 'frame'

I don't need to keep the information in the class once I have parsed it out so I was thinking perhaps it could be cleared or reset somehow and then updated to hold the new url information from page 2:n but I have no idea how to do this?

Alternatively if anyone knows another way to grab this specific data from this page and the following ones then that would be equally helpful?

Many thanks in advance.

6
  • Try currentFrame() instead of frame Commented Nov 26, 2017 at 11:36
  • thanks for the advise, where would I put that? In the class definition or in the loop statement? or both? Commented Nov 26, 2017 at 11:53
  • there: change result = r.frame.toHtml() to result = r.currentFrame().toHtml() I can't tell if this is the answer to your question, therefore I haven't posted it as possible solution. Commented Nov 26, 2017 at 11:55
  • Its seeming to get stuck at the line 'result = r.currentFrame().toHtml()' any idea why it may do that? Commented Nov 26, 2017 at 12:00
  • Maybe because my answer is wrong... Commented Nov 26, 2017 at 12:04

1 Answer 1

1

How about using selenium and phantomjs instead of PyQt.
You can easily get selenium by executing "pip install selenium". If you use Mac you can get phantomjs by executing "brew install phantomjs". If your PC is Windows use choco instead of brew, or Ubuntu use apt-get.

from selenium import webdriver
from bs4 import BeautifulSoup

base_url = "https://uk.reuters.com"
first_page = "/business/markets/index/.FTSE?sortBy=&sortDir=&pn=1"

browser = webdriver.PhantomJS()

# PARSE THE HTML
browser.get(base_url + first_page)
soup = BeautifulSoup(browser.page_source, "lxml")
row_data = soup.find('div', attrs={'class':'column1 gridPanel grid8'})

# PARSE ALL ROW DATA
stripe_rows = row_data.findAll('tr', attrs={'class':'stripe'})
non_stripe_rows = row_data.findAll('tr', attrs={'class':''})
print(len(stripe_rows), len(non_stripe_rows))

# GO TO THE NEXT PAGE
next_button = soup.find("li", attrs={"class":"next"})
while next_button:
  next_page = next_button.find("a")["href"]
  browser.get(base_url + next_page)
  soup = BeautifulSoup(browser.page_source, "lxml")
  row_data = soup.find('div', attrs={'class':'column1 gridPanel grid8'})
  stripe_rows = row_data.findAll('tr', attrs={'class':'stripe'})
  non_stripe_rows = row_data.findAll('tr', attrs={'class':''})
  print(len(stripe_rows), len(non_stripe_rows))
  next_button = soup.find("li", attrs={"class":"next"})

# DONT FORGET THIS!!
browser.quit()

I know the code above is not efficient (too slow I feel), but I think that it will bring you the results you desire. In addition, if the web page you want to scrape does not use Javascript, even PhantomJS and selenium are unnecessary. You can use the requests module. However, since I wanted to show you the contrast with PyQt, I used PhantomJS and Selenium in this answer.

Sign up to request clarification or add additional context in comments.

1 Comment

This is a great solution! It does exactly what i wanted! Thank you very much! Just for anyone else who may be be using anaconda; I installed from the command from with 'conda install -c conda-forge selenium', then followed the instructions here 'stackoverflow.com/questions/37903536/…' by 'dung' to add to the path in order to get it to work.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.