1

I can't figure out how to use multithreading/multiprocessing in python to speed up this scraping process getting all the usernames from the hashtag 'cats' on instagram.
My goal is to make this as fast as possible because currently the process is kinda slow

from instaloader import Instaloader

HASHTAG = 'cats'

loader = Instaloader(sleep=False)

users = []
for post in loader.get_hashtag_posts(HASHTAG):
    if post.owner_username not in users:
        users.append(post.owner_username) 
    print(post.owner_username)
5
  • Make users a set() instead of a list. For multithreading, maybe divide the posts list into 4 parts and execute the loop for them in 4 threads separately. Merge the users set from each of them at the end. Commented Feb 9, 2020 at 12:48
  • @AnmolSinghJaggi Can you show me? Commented Feb 9, 2020 at 15:47
  • Will do soon... Commented Feb 11, 2020 at 16:02
  • Well, I tried to run the program and noticed that its slow because ` loader.get_hashtag_posts(HASHTAG)` is a generator which is returning the posts at a really slow rate. So this is a problem with the instaloader library itself and we cant do much about it. But just for completeness, I might write an answer to show how to use multithreading when I get more time. Commented Feb 13, 2020 at 17:12
  • Update: I actually implemented multithreading and it does seem to be significantly faster somehow. Have posted as an answer. All the best! Commented Feb 13, 2020 at 17:30

2 Answers 2

2

The LockedIterator is inspired from here.

import threading
from instaloader import Instaloader


class LockedIterator(object):
    def __init__(self, it):
        self.lock = threading.Lock()
        self.it = it.__iter__()

    def __iter__(self):
        return self

    def __next__(self):
        self.lock.acquire()
        try:
            return self.it.__next__()
        finally:
            self.lock.release()


HASHTAG = 'cats'
posts = Instaloader(sleep=False).get_hashtag_posts(HASHTAG)
posts = LockedIterator(posts)
users = set()


def worker():
    try:
        for post in posts:
            print(post.owner_username)
            users.add(post.owner_username)
    except Exception as e:
        print(e)
        raise


threads = []


for i in range(4):
    t = threading.Thread(target=worker)
    threads.append(t)
    t.start()

for t in threads:
    t.join()
Sign up to request clarification or add additional context in comments.

1 Comment

thank you so much man, works like charm. Currently I'm facing a output problem. I changed a bit to have an input file containing hashtags and it should then create for every hashtag a output file with all the usernames but my script exports everything into the first .txt file an not each one as it should be. I'll post the code in this thread^
0

Goal is to have an input file and seperated output.txt files, maybe you can help me here to

It should be something with line 45

And i'm not really advanced so my try may contains some wrong code, I don't know

As an example hashtags for input.txt I used the: wqddt & d2deltas

from instaloader import Instaloader
import threading
import io
import time
import sys

class LockedIterator(object):
    def __init__(self, it):
        self.lock = threading.Lock()
        self.it = it.__iter__()

    def __iter__(self):
        return self

    def __next__(self):
        self.lock.acquire()
        try:
            return self.it.__next__()
        finally:
            self.lock.release()

f = open('input.txt','r',encoding='utf-8')
HASHTAG = f.read()
p = HASHTAG.split('\n')
PROFILE = p[:]

for ind in range(len(PROFILE)):
    pro = PROFILE[ind]

posts = Instaloader(sleep=False).get_hashtag_posts(pro)
posts = LockedIterator(posts)
users = set()

start_time = time.time()

PROFILE = p[:]

def worker():
    for ind in range(len(PROFILE)):
        pro = PROFILE[ind]
        try:
            filename = 'downloads/'+pro+'.txt'
            fil = open(filename,'a',newline='',encoding="utf-8")

            for post in posts:
                   hashtags = post.owner_username
                   fil.write(str(hashtags)+'\n')

        except:
            print('Skipping',pro)


threads = []

for i in range(4): #Input Threads
    t = threading.Thread(target=worker)
    threads.append(t)
    t.start()

for t in threads:
    t.join()

end_time = time.time()
print("Done")
print("Time taken : " + str(end_time - start_time) + "sec")    

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.