0

I need advice from you. Right now i do some computation with pandas library. Program is using multiprocessing and df.apply. The simple example showing my idea is here:

import multiprocessing
import pandas as pd
   

def f2(row, item):
    # do some computation with item and  rows values and return something
    return 'something'


def f1(item):
    d1 = {'col1': [4,5,6], 'col2': [7,8,9]}
    df = pd.DataFrame(d1)

    df['col3'] = df.apply(f2, args=(item,))


if __name__ == '__main__':

    l1 = [1,2,3]

    for item in l1:
        x = multiprocessing.Process(target=f1, args(item, ))

I have PC and another one. That is why I am thinking about "local cluster'. How can I run this code using dask distributed library?
What should I change in this code? Does dask distributed works with multiprocessing?

Will dask distributed be faster than work on single machine?

It computes on small df - c.a. 25000 rows

2
  • Hi, is your example really working (sorry, didn't test it). It looks a bit weird to me, you'll end up with three different dataframes? With Dask, you don't have to worry about using multi processing, it does that for you, but you might want to check Dask documentation that gives answer to your questions. You'll be able to use multi processing, or distributed, or threaded mode, and apply you function in a aprallel way. Commented Aug 8 at 17:38
  • This pice of code is a draft. Actually i want to check it with dask distributed, but i do not knowa how to configure it. If i have PC1 and PC2 in local network. Do you have some example. I will need example for running python script. What should I do on PC1 and what on PC2. Right now I am reading manuals, but it describe only local cluster. Wha Commented Aug 16 at 16:45

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.