0

So I wrote a script to scrape a table on a website - NFL roster for 32 teams, over 4 years. The website, however, only shows one team, and year at a time. So my script opens the page, selects a year, scrape the data, then moves on to the next year, and so on till all four years of data are gathered. It then repeats the process for the other 32 teams.

Now, I'm new to web scraping, so I'm not sure that computationally, what I'm doing is the best way to go about it. Currently, to scrape one year of data for one team, it takes roughly 40-50s, so in total, roughly 4 minute per team. To scrape all the years for all the teams, that comes up to over two hours.

Is there a way to scrape the data and decrease runtime?

Code is below:

import requests
import lxml.html as lh
import pandas as pd
import time
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time

# Team list
team_ls = ['Arizona Cardinals','Atlanta Falcons','Baltimore Ravens','Buffalo Bills','Carolina Panthers','Chicago Bears','Cincinnati Bengals',
           'Cleveland Browns','Dallas Cowboys','Denver Broncos','Detroit Lions','Green Bay Packers','Houston Texans','Indianapolis Colts',
           'Jacksonville Jaguars','Kansas City Chiefs','Las Vegas Raiders','Los Angeles Chargers','Los Angeles Rams','Miami Dolphins','Minnesota Vikings','New England Patriots',
           'New Orleans Saints','New York Giants','New York Jets','Philadelphia Eagles','Pittsburgh Steelers','San Francisco 49ers','Seattle Seahawks',
           'Tampa Bay Buccaneers','Tennessee Titans','Washington Redskins']

# Format list for URL
team_ls = [team.lower().replace(' ','-') for team in team_ls]

# Changes the year parameter on a given pages
def next_year(driver, year_idx):
    
    driver.find_element_by_xpath('//*[@id="main-dropdown"]').click()
    parentElement = driver.find_element_by_xpath('/html/body/app-root/app-nfl/app-roster/div/div/div[2]/div/div/div[1]/div/div/div')
    elementList = parentElement.find_elements_by_tag_name("button")
    elementList[year_idx].click()
    time.sleep(3)

# Create scraping function
def sel_scrape(driver, team, year):
    
    # Get main table
    main_table = driver.find_element_by_tag_name('table')
    
    # Scrape rows and header
    rows = [[td.text.strip() for td in row.find_elements_by_xpath(".//td")] for row in main_table.find_elements_by_xpath(".//tr")][1:]
    header = [[th.text.strip() for th in row.find_elements_by_xpath(".//th")] for row in main_table.find_elements_by_xpath(".//tr")][0]
    
    # compile in dataframe
    df=pd.DataFrame(rows,columns=header)
    
    # Edit data frame
    df['Merge Name'] = df['Name'].str.split(' ',1).str[0].str[0] + '.' + df['Name'].str.split(' ').str[1]
    df['Team'] = team.replace('-',' ').title()
    df['Year'] = year
    
    return df

url='https://www.lineups.com/nfl/roster/'

df = pd.DataFrame()
years = [2020,2019,2018,2017]

start_time = time.time()

for team in team_ls:
    driver = webdriver.Chrome()
    # Generate team link
    driver.get(url+team)
    
    # For each of the four years
    for idx in range(0,4):
        print("Starting {} {}".format(team, years[idx]))
        # Scrape the page
        df = pd.concat([df, sel_scrape(driver, team, years[idx])])
        
        # Change to next year
        next_year(driver, idx)
    driver. close()

print("--- %s seconds ---" % (time.time() - start_time))
    
df.head()

1 Answer 1

1

You can improve by not using Selenium. Selenium (while it works) will naturally run slower. The best way to get the data is through the API where it renders that data:

import pandas as pd
import requests
import time

# Team list
team_ls = ['Arizona Cardinals','Atlanta Falcons','Baltimore Ravens','Buffalo Bills','Carolina Panthers','Chicago Bears','Cincinnati Bengals',
           'Cleveland Browns','Dallas Cowboys','Denver Broncos','Detroit Lions','Green Bay Packers','Houston Texans','Indianapolis Colts',
           'Jacksonville Jaguars','Kansas City Chiefs','Las Vegas Raiders','Los Angeles Chargers','Los Angeles Rams','Miami Dolphins','Minnesota Vikings','New England Patriots',
           'New Orleans Saints','New York Giants','New York Jets','Philadelphia Eagles','Pittsburgh Steelers','San Francisco 49ers','Seattle Seahawks',
           'Tampa Bay Buccaneers','Tennessee Titans','Washington Redskins']


rows = []
start_time = time.time()
for team in team_ls:
    for season in range(2017,2021):
        print ('Season: %s\tTeam: %s' %(season, team))
        teamStr = '-'.join(team.split()).lower()
        url= 'https://api.lineups.com/nfl/fetch/roster/{season}/{teamStr}'.format(season=season, teamStr=teamStr)

        jsonData = requests.get(url).json()
        roster = jsonData['data']
        for item in roster:
            item.update( {'Year':season, 'Team':team})
        rows += roster
        
df = pd.DataFrame(rows)

print("--- %s seconds ---" % (time.time() - start_time))

print (df.head())  

enter image description here

Sign up to request clarification or add additional context in comments.

2 Comments

Thanks! Where did you get that api url?
If you go to Dev Tools (right-click and Inspect. You may need to reload the page). And in the pane, go to tabs Network -> XHR -> Headers to see what requests are made. I'll add a pic in the solution above

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.