1

I am trying to decide if Spark or Dask gives us better performance for the work we are doing. I have a simple script that runs some operations on a DataFrame.

I am not convinced I am using distributed version correct, as the times are much slower than using dask locally. Here is my script:

 def CreateTransactionFile(inputFile, client):
     startTime = time.time()
     df = dd.read_csv(inputFile)

     mock = pd.DataFrame([[1,1, datetime.today(), 1,1]], columns=['A', 'B', 'C', 'D', 'E'])

     outDf = df.map_partitions(CreateTransactionFile_Partition, meta=mock)
     outDf.compute()
     print(str(time.time() - startTime))


 if __name__ == '__main__':
     client = Client('10.184.62.61:8786')
     hdfs = 'hdfs://dir/python/test/bigger.csv'
     CreateTransactionFile(hdfs , client)

CreateTransactionFile_Partition operates using Pandas and Numpy on the provided dateframe, and returns a dataframe as a result.

Should I be using something other than compute? The above code is twice as slow (230s vs 550s) on a 700M row CSV (~30GB) than when running on a local machine. Local test is using local file, where multi-worker is using HDFS.

1 Answer 1

2

outDf.compute()

What's happening here: the workers are loading and processing the partitions of the data, and then the results are being copied to the client and assembled into a single, in-memory data-frame. This copying requires potentially expensive inter-process communication. That might be what you want, if the processing is aggregational, and the output small.

However, if the output is large, you want to do your processing on the workers by using the dataframe API without .compute(), perhaps writing the output to files with, e.g., .to_parquet().

Sign up to request clarification or add additional context in comments.

2 Comments

I wonder if it would be faster for each partition result to be written to parquet, and then the driver loads the results from there, instead of the current .compute. My desired result is to get a DataFrame (in memory) that has some bespoke logic applied to it. Its not important how that is achieved.
By "in memory", you probably mean the memory of the workers, not the client. But sure, you can try writing the intermediates, and measure the difference. It will depend on your exact operations.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.