Dask performance on distributed workers

Question

I am trying to decide if Spark or Dask gives us better performance for the work we are doing. I have a simple script that runs some operations on a DataFrame.

I am not convinced I am using distributed version correct, as the times are much slower than using dask locally. Here is my script:

 def CreateTransactionFile(inputFile, client):
     startTime = time.time()
     df = dd.read_csv(inputFile)

     mock = pd.DataFrame([[1,1, datetime.today(), 1,1]], columns=['A', 'B', 'C', 'D', 'E'])

     outDf = df.map_partitions(CreateTransactionFile_Partition, meta=mock)
     outDf.compute()
     print(str(time.time() - startTime))


 if __name__ == '__main__':
     client = Client('10.184.62.61:8786')
     hdfs = 'hdfs://dir/python/test/bigger.csv'
     CreateTransactionFile(hdfs , client)

CreateTransactionFile_Partition operates using Pandas and Numpy on the provided dateframe, and returns a dataframe as a result.

Should I be using something other than compute? The above code is twice as slow (230s vs 550s) on a 700M row CSV (~30GB) than when running on a local machine. Local test is using local file, where multi-worker is using HDFS.

mdurant · Accepted Answer · 2018-11-29 16:17:08Z

2

outDf.compute()

What's happening here: the workers are loading and processing the partitions of the data, and then the results are being copied to the client and assembled into a single, in-memory data-frame. This copying requires potentially expensive inter-process communication. That might be what you want, if the processing is aggregational, and the output small.

However, if the output is large, you want to do your processing on the workers by using the dataframe API without .compute(), perhaps writing the output to files with, e.g., .to_parquet().

answered Nov 29, 2018 at 16:17

mdurant

28.8k5 gold badges49 silver badges79 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Giannis Over a year ago

I wonder if it would be faster for each partition result to be written to parquet, and then the driver loads the results from there, instead of the current .compute. My desired result is to get a DataFrame (in memory) that has some bespoke logic applied to it. Its not important how that is achieved.

mdurant Over a year ago

By "in memory", you probably mean the memory of the workers, not the client. But sure, you can try writing the intermediates, and measure the difference. It will depend on your exact operations.

Collectives™ on Stack Overflow

Dask performance on distributed workers

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related