How can I improve performance (runtime) on my webscraping script (Python and Selenium)

Question

So I wrote a script to scrape a table on a website - NFL roster for 32 teams, over 4 years. The website, however, only shows one team, and year at a time. So my script opens the page, selects a year, scrape the data, then moves on to the next year, and so on till all four years of data are gathered. It then repeats the process for the other 32 teams.

Now, I'm new to web scraping, so I'm not sure that computationally, what I'm doing is the best way to go about it. Currently, to scrape one year of data for one team, it takes roughly 40-50s, so in total, roughly 4 minute per team. To scrape all the years for all the teams, that comes up to over two hours.

Is there a way to scrape the data and decrease runtime?

Code is below:

import requests
import lxml.html as lh
import pandas as pd
import time
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time

# Team list
team_ls = ['Arizona Cardinals','Atlanta Falcons','Baltimore Ravens','Buffalo Bills','Carolina Panthers','Chicago Bears','Cincinnati Bengals',
           'Cleveland Browns','Dallas Cowboys','Denver Broncos','Detroit Lions','Green Bay Packers','Houston Texans','Indianapolis Colts',
           'Jacksonville Jaguars','Kansas City Chiefs','Las Vegas Raiders','Los Angeles Chargers','Los Angeles Rams','Miami Dolphins','Minnesota Vikings','New England Patriots',
           'New Orleans Saints','New York Giants','New York Jets','Philadelphia Eagles','Pittsburgh Steelers','San Francisco 49ers','Seattle Seahawks',
           'Tampa Bay Buccaneers','Tennessee Titans','Washington Redskins']

# Format list for URL
team_ls = [team.lower().replace(' ','-') for team in team_ls]

# Changes the year parameter on a given pages
def next_year(driver, year_idx):
    
    driver.find_element_by_xpath('//*[@id="main-dropdown"]').click()
    parentElement = driver.find_element_by_xpath('/html/body/app-root/app-nfl/app-roster/div/div/div[2]/div/div/div[1]/div/div/div')
    elementList = parentElement.find_elements_by_tag_name("button")
    elementList[year_idx].click()
    time.sleep(3)

# Create scraping function
def sel_scrape(driver, team, year):
    
    # Get main table
    main_table = driver.find_element_by_tag_name('table')
    
    # Scrape rows and header
    rows = [[td.text.strip() for td in row.find_elements_by_xpath(".//td")] for row in main_table.find_elements_by_xpath(".//tr")][1:]
    header = [[th.text.strip() for th in row.find_elements_by_xpath(".//th")] for row in main_table.find_elements_by_xpath(".//tr")][0]
    
    # compile in dataframe
    df=pd.DataFrame(rows,columns=header)
    
    # Edit data frame
    df['Merge Name'] = df['Name'].str.split(' ',1).str[0].str[0] + '.' + df['Name'].str.split(' ').str[1]
    df['Team'] = team.replace('-',' ').title()
    df['Year'] = year
    
    return df

url='https://www.lineups.com/nfl/roster/'

df = pd.DataFrame()
years = [2020,2019,2018,2017]

start_time = time.time()

for team in team_ls:
    driver = webdriver.Chrome()
    # Generate team link
    driver.get(url+team)
    
    # For each of the four years
    for idx in range(0,4):
        print("Starting {} {}".format(team, years[idx]))
        # Scrape the page
        df = pd.concat([df, sel_scrape(driver, team, years[idx])])
        
        # Change to next year
        next_year(driver, idx)
    driver. close()

print("--- %s seconds ---" % (time.time() - start_time))
    
df.head()

chitown88 · Accepted Answer · 2020-06-24 13:50:14Z

1

You can improve by not using Selenium. Selenium (while it works) will naturally run slower. The best way to get the data is through the API where it renders that data:

import pandas as pd
import requests
import time

# Team list
team_ls = ['Arizona Cardinals','Atlanta Falcons','Baltimore Ravens','Buffalo Bills','Carolina Panthers','Chicago Bears','Cincinnati Bengals',
           'Cleveland Browns','Dallas Cowboys','Denver Broncos','Detroit Lions','Green Bay Packers','Houston Texans','Indianapolis Colts',
           'Jacksonville Jaguars','Kansas City Chiefs','Las Vegas Raiders','Los Angeles Chargers','Los Angeles Rams','Miami Dolphins','Minnesota Vikings','New England Patriots',
           'New Orleans Saints','New York Giants','New York Jets','Philadelphia Eagles','Pittsburgh Steelers','San Francisco 49ers','Seattle Seahawks',
           'Tampa Bay Buccaneers','Tennessee Titans','Washington Redskins']


rows = []
start_time = time.time()
for team in team_ls:
    for season in range(2017,2021):
        print ('Season: %s\tTeam: %s' %(season, team))
        teamStr = '-'.join(team.split()).lower()
        url= 'https://api.lineups.com/nfl/fetch/roster/{season}/{teamStr}'.format(season=season, teamStr=teamStr)

        jsonData = requests.get(url).json()
        roster = jsonData['data']
        for item in roster:
            item.update( {'Year':season, 'Team':team})
        rows += roster
        
df = pd.DataFrame(rows)

print("--- %s seconds ---" % (time.time() - start_time))

print (df.head())

edited Jun 24, 2020 at 13:50

answered Jun 24, 2020 at 10:02

chitown88

29.1k6 gold badges34 silver badges67 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Steven Cunden Over a year ago

Thanks! Where did you get that api url?

chitown88 Over a year ago

If you go to Dev Tools (right-click and Inspect. You may need to reload the page). And in the pane, go to tabs Network -> XHR -> Headers to see what requests are made. I'll add a pic in the solution above

Collectives™ on Stack Overflow

How can I improve performance (runtime) on my webscraping script (Python and Selenium)

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related