I am trying to decide if Spark or Dask gives us better performance for the work we are doing. I have a simple script that runs some operations on a DataFrame.
I am not convinced I am using distributed version correct, as the times are much slower than using dask locally. Here is my script:
def CreateTransactionFile(inputFile, client):
startTime = time.time()
df = dd.read_csv(inputFile)
mock = pd.DataFrame([[1,1, datetime.today(), 1,1]], columns=['A', 'B', 'C', 'D', 'E'])
outDf = df.map_partitions(CreateTransactionFile_Partition, meta=mock)
outDf.compute()
print(str(time.time() - startTime))
if __name__ == '__main__':
client = Client('10.184.62.61:8786')
hdfs = 'hdfs://dir/python/test/bigger.csv'
CreateTransactionFile(hdfs , client)
CreateTransactionFile_Partition operates using Pandas and Numpy on the provided dateframe, and returns a dataframe as a result.
Should I be using something other than compute? The above code is twice as slow (230s vs 550s) on a 700M row CSV (~30GB) than when running on a local machine. Local test is using local file, where multi-worker is using HDFS.