Python - Web Scraping concurrent to improve my code?

Question

So I'm pulling statistics of NFL players. The table only shows max 50 rows, so I have to filter it down to make sure I don't miss any stats, which means I'm iterating through the pages to collect all the data by Season, by Position, by Team, by Week.

I figured out how the url changes to cycle through these, but the iteration process takes so long, and was thinking: we're able to open multiple webpages at one time, couldn't I be able to run these processes parallel, where each process simultaneously collects the data from each page, stores it in its temp_df, then merge them all at the end...instead of collecting one url, by one url, then merge, then next url, then merge, then next,......at a time. Meaning this iterates through 6,144 times (if I'm not iterating through the positions), but with the positions, over 36,000 iteration through.

But I'm stuck on how to implement it, or if it's even possible.

Here's the code I'm currently using. I eliminated the cycle through position to just give an idea of how its working, which for quarterbacks, the p = 2.

So it starts at season 2005 = 1, team 1 = 1, week 1 =0, then iterates all those to the last season 2016 = 12, team 32 = 33, and week 16 = 17:

import requests
import pandas as pd

seasons = list(range(1,13))
teams = list(range(1,33))
weeks = list(range(0,17))


qb_df = pd.DataFrame()

p = 2
for s in seasons:
    for t in teams:
        for w in weeks:
        url = 'https://fantasydata.com/nfl-stats/nfl-fantasy-football-stats.aspx?fs=2&stype=0&sn=%s&scope=1&w=%s&ew=%s&s=&t=%s&p=%s&st=FantasyPointsFanDuel&d=1&ls=FantasyPointsFanDuel&live=false&pid=true&minsnaps=4' % (s,w,w,t,p)
        html = requests.get(url).content
        df_list = pd.read_html(html)
        temp_df = df_list[-1]
        temp_df['NFL Season'] = str(2017-s)
        qb_df = qb_df.append(temp_df, ignore_index = True)


file = 'player_data_fanduel_2005_to_2016_qb.xls'
qb_df.to_excel(file)               
print('\nData has been saved.')

I agree with @AnuragMisra. Using scrapy You can create multiple spiders and pipelines to do what you need. — floatingpurr
– floatingpurr, Commented Sep 11, 2017 at 11:01
is scrapy pretty easy to pick up? I'm new to python and brand new to web scraping (just started playing with it a few days ago). The other library I've seen talked about a lot is BeautifulSoup. — chitown88
– chitown88, Commented Sep 11, 2017 at 11:24

DhruvPathak · Accepted Answer · 2017-09-11 11:03:04Z

1

1/ Create a dict of season, team, weeks and urls.

2/ Use multiprocessing pool to call urls and get data.

Or use a dedicated scraping tool like Scrapy.

answered Sep 11, 2017 at 11:03

DhruvPathak

43.4k17 gold badges125 silver badges180 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

zmbq · Accepted Answer · 2017-09-11 11:07:58Z

1

First you have to keep in mind that some servers will recognize this surprising load from on IP and block your access (there are internet appliances that do this automatically), so you probably don't want to issue hundreds of requests in parallel.

If you don't use something like Scrapy, you don't need to resort to multithreading or multiprocessing. You'll probably be better off using asynchronous I/O. Python 3.5 supports async functions quite well. They are very easy to work with.

answered Sep 11, 2017 at 11:07

zmbq

39.1k15 gold badges109 silver badges188 bronze badges

1 Comment

chitown88 Over a year ago

Oh that's a good point. I wasn't even considering the overload on the server. I mean ultimately, I'm ok with toning it back...don't need to send 1000s of requests at once, but I mean if I could do even cut the time in half by doing 2 at a time, I'd be happy.

Collectives™ on Stack Overflow

Python - Web Scraping concurrent to improve my code?

2 Answers 2

Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related