So I wrote a script to scrape a table on a website - NFL roster for 32 teams, over 4 years. The website, however, only shows one team, and year at a time. So my script opens the page, selects a year, scrape the data, then moves on to the next year, and so on till all four years of data are gathered. It then repeats the process for the other 32 teams.
Now, I'm new to web scraping, so I'm not sure that computationally, what I'm doing is the best way to go about it. Currently, to scrape one year of data for one team, it takes roughly 40-50s, so in total, roughly 4 minute per team. To scrape all the years for all the teams, that comes up to over two hours.
Is there a way to scrape the data and decrease runtime?
Code is below:
import requests
import lxml.html as lh
import pandas as pd
import time
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
# Team list
team_ls = ['Arizona Cardinals','Atlanta Falcons','Baltimore Ravens','Buffalo Bills','Carolina Panthers','Chicago Bears','Cincinnati Bengals',
'Cleveland Browns','Dallas Cowboys','Denver Broncos','Detroit Lions','Green Bay Packers','Houston Texans','Indianapolis Colts',
'Jacksonville Jaguars','Kansas City Chiefs','Las Vegas Raiders','Los Angeles Chargers','Los Angeles Rams','Miami Dolphins','Minnesota Vikings','New England Patriots',
'New Orleans Saints','New York Giants','New York Jets','Philadelphia Eagles','Pittsburgh Steelers','San Francisco 49ers','Seattle Seahawks',
'Tampa Bay Buccaneers','Tennessee Titans','Washington Redskins']
# Format list for URL
team_ls = [team.lower().replace(' ','-') for team in team_ls]
# Changes the year parameter on a given pages
def next_year(driver, year_idx):
driver.find_element_by_xpath('//*[@id="main-dropdown"]').click()
parentElement = driver.find_element_by_xpath('/html/body/app-root/app-nfl/app-roster/div/div/div[2]/div/div/div[1]/div/div/div')
elementList = parentElement.find_elements_by_tag_name("button")
elementList[year_idx].click()
time.sleep(3)
# Create scraping function
def sel_scrape(driver, team, year):
# Get main table
main_table = driver.find_element_by_tag_name('table')
# Scrape rows and header
rows = [[td.text.strip() for td in row.find_elements_by_xpath(".//td")] for row in main_table.find_elements_by_xpath(".//tr")][1:]
header = [[th.text.strip() for th in row.find_elements_by_xpath(".//th")] for row in main_table.find_elements_by_xpath(".//tr")][0]
# compile in dataframe
df=pd.DataFrame(rows,columns=header)
# Edit data frame
df['Merge Name'] = df['Name'].str.split(' ',1).str[0].str[0] + '.' + df['Name'].str.split(' ').str[1]
df['Team'] = team.replace('-',' ').title()
df['Year'] = year
return df
url='https://www.lineups.com/nfl/roster/'
df = pd.DataFrame()
years = [2020,2019,2018,2017]
start_time = time.time()
for team in team_ls:
driver = webdriver.Chrome()
# Generate team link
driver.get(url+team)
# For each of the four years
for idx in range(0,4):
print("Starting {} {}".format(team, years[idx]))
# Scrape the page
df = pd.concat([df, sel_scrape(driver, team, years[idx])])
# Change to next year
next_year(driver, idx)
driver. close()
print("--- %s seconds ---" % (time.time() - start_time))
df.head()
