0

So I'm pulling statistics of NFL players. The table only shows max 50 rows, so I have to filter it down to make sure I don't miss any stats, which means I'm iterating through the pages to collect all the data by Season, by Position, by Team, by Week.

I figured out how the url changes to cycle through these, but the iteration process takes so long, and was thinking: we're able to open multiple webpages at one time, couldn't I be able to run these processes parallel, where each process simultaneously collects the data from each page, stores it in its temp_df, then merge them all at the end...instead of collecting one url, by one url, then merge, then next url, then merge, then next,......at a time. Meaning this iterates through 6,144 times (if I'm not iterating through the positions), but with the positions, over 36,000 iteration through.

But I'm stuck on how to implement it, or if it's even possible.

Here's the code I'm currently using. I eliminated the cycle through position to just give an idea of how its working, which for quarterbacks, the p = 2.

So it starts at season 2005 = 1, team 1 = 1, week 1 =0, then iterates all those to the last season 2016 = 12, team 32 = 33, and week 16 = 17:

import requests
import pandas as pd

seasons = list(range(1,13))
teams = list(range(1,33))
weeks = list(range(0,17))


qb_df = pd.DataFrame()

p = 2
for s in seasons:
    for t in teams:
        for w in weeks:
        url = 'https://fantasydata.com/nfl-stats/nfl-fantasy-football-stats.aspx?fs=2&stype=0&sn=%s&scope=1&w=%s&ew=%s&s=&t=%s&p=%s&st=FantasyPointsFanDuel&d=1&ls=FantasyPointsFanDuel&live=false&pid=true&minsnaps=4' % (s,w,w,t,p)
        html = requests.get(url).content
        df_list = pd.read_html(html)
        temp_df = df_list[-1]
        temp_df['NFL Season'] = str(2017-s)
        qb_df = qb_df.append(temp_df, ignore_index = True)


file = 'player_data_fanduel_2005_to_2016_qb.xls'
qb_df.to_excel(file)               
print('\nData has been saved.')
3
  • 1
    i would suggest you to use scrapy scrapy documentation Commented Sep 11, 2017 at 10:59
  • I agree with @AnuragMisra. Using scrapy You can create multiple spiders and pipelines to do what you need. Commented Sep 11, 2017 at 11:01
  • is scrapy pretty easy to pick up? I'm new to python and brand new to web scraping (just started playing with it a few days ago). The other library I've seen talked about a lot is BeautifulSoup. Commented Sep 11, 2017 at 11:24

2 Answers 2

1

1/ Create a dict of season, team, weeks and urls.

2/ Use multiprocessing pool to call urls and get data.

Or use a dedicated scraping tool like Scrapy.

Sign up to request clarification or add additional context in comments.

Comments

1

First you have to keep in mind that some servers will recognize this surprising load from on IP and block your access (there are internet appliances that do this automatically), so you probably don't want to issue hundreds of requests in parallel.

If you don't use something like Scrapy, you don't need to resort to multithreading or multiprocessing. You'll probably be better off using asynchronous I/O. Python 3.5 supports async functions quite well. They are very easy to work with.

1 Comment

Oh that's a good point. I wasn't even considering the overload on the server. I mean ultimately, I'm ok with toning it back...don't need to send 1000s of requests at once, but I mean if I could do even cut the time in half by doing 2 at a time, I'd be happy.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.