1

I am using wikipedia python package to scrape data of a particular topic

q=['NASA', 'NASA_insignia', 'NASA_spinoff_technologies', 'NASA_facilities', 'NASA_Pathfinder', 'List_of_NASA_missions', 'Langley_Research_Center', 'NASA-TLX', 'Budget_of_NASA', 'NASA_(disambiguation)']

Example above, I've searched for NASA. Now I need to obtain the summary for each of the element in the list.

ny = []
for i in range(len(q)):
    y = wikipedia.page(q[i])
    x = y.summary
    ny.append(x)

In doing this whole process i.e. traversing each element of list and retrieving summary of each element, it's taking almost 40-60 seconds for the entire process to be completed (even with a good network connection)

I don't know much about multiprocessing / multithreading. How can i speed up the execution by a considerable time? Any help will be appreciated.

1 Answer 1

0

You can use a processing pool (see documentation).

Here is an example based on your code:

from multiprocessing import Pool


q = ['NASA', 'NASA_insignia', 'NASA_spinoff_technologies', 'NASA_facilities', 'NASA_Pathfinder',
     'List_of_NASA_missions', 'Langley_Research_Center', 'NASA-TLX', 'Budget_of_NASA', 'NASA_(disambiguation)']

def f(q_i):
    y = wikipedia.page(q_i)
    return y.summary

with Pool(5) as p:
    ny = p.map(f, q)

Basically f is applied for each element in q in separate processes. You can determine the number of processes when defining the pool (5 in my example).

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.