So I'm pulling statistics of NFL players. The table only shows max 50 rows, so I have to filter it down to make sure I don't miss any stats, which means I'm iterating through the pages to collect all the data by Season, by Position, by Team, by Week.
I figured out how the url changes to cycle through these, but the iteration process takes so long, and was thinking: we're able to open multiple webpages at one time, couldn't I be able to run these processes parallel, where each process simultaneously collects the data from each page, stores it in its temp_df, then merge them all at the end...instead of collecting one url, by one url, then merge, then next url, then merge, then next,......at a time. Meaning this iterates through 6,144 times (if I'm not iterating through the positions), but with the positions, over 36,000 iteration through.
But I'm stuck on how to implement it, or if it's even possible.
Here's the code I'm currently using. I eliminated the cycle through position to just give an idea of how its working, which for quarterbacks, the p = 2.
So it starts at season 2005 = 1, team 1 = 1, week 1 =0, then iterates all those to the last season 2016 = 12, team 32 = 33, and week 16 = 17:
import requests
import pandas as pd
seasons = list(range(1,13))
teams = list(range(1,33))
weeks = list(range(0,17))
qb_df = pd.DataFrame()
p = 2
for s in seasons:
for t in teams:
for w in weeks:
url = 'https://fantasydata.com/nfl-stats/nfl-fantasy-football-stats.aspx?fs=2&stype=0&sn=%s&scope=1&w=%s&ew=%s&s=&t=%s&p=%s&st=FantasyPointsFanDuel&d=1&ls=FantasyPointsFanDuel&live=false&pid=true&minsnaps=4' % (s,w,w,t,p)
html = requests.get(url).content
df_list = pd.read_html(html)
temp_df = df_list[-1]
temp_df['NFL Season'] = str(2017-s)
qb_df = qb_df.append(temp_df, ignore_index = True)
file = 'player_data_fanduel_2005_to_2016_qb.xls'
qb_df.to_excel(file)
print('\nData has been saved.')
scrapyscrapy documentationscrapyYou can create multiple spiders and pipelines to do what you need.scrapypretty easy to pick up? I'm new to python and brand new to web scraping (just started playing with it a few days ago). The other library I've seen talked about a lot isBeautifulSoup.