10

I have a dataframe of params and apply a function to each row. this function is essentially a couple of sql_queries and simple calculations on the result.

I am trying to leverage Dask's multiprocessing while keeping structure and ~ interface. The example below works and indeed has a significant boost:

def get_metrics(row):

    record = {'areaName': row['name'],
              'areaType': row.area_type,
              'borough': row.Borough,
              'fullDate': row['start'],
              'yearMonth': row['start'],
              }


    Q = Qsi.format(unittypes=At,
                   start_date=row['start'],
                   end_date=row['end'],
                   freq='Q',
                   area_ids=row['descendent_ids'])

    sales = _get_DF(Q)
    record['salesInventory'] = len(sales)
    record['medianAskingPrice'] = sales.price.median()
    R.append(record)

R = []
x = ddf.map_partition(lambda x: x.apply(_metric, axis=1), meta={'result': None})
    x.compute()

result2 = pd.DataFrame(R)

However, when I try to use .apply method instead (see below), it throws me 'DataFrame' object has no attribute 'name'...

R = list()
y = ddf.apply(_metrics, axis=1, meta={'result': None})

Yet, ddf.head() shows that there is a name column in the dataframe

3
  • You write dask_DF.apply() but say that ddf has a name column. Try ddf.apply(). Commented Oct 13, 2017 at 20:04
  • thanks, but that is just (resolved) misspelling, as I try to simplify the code here. It has nothing to do with the issue Commented Oct 13, 2017 at 20:06
  • The accepted answer also works for me. But the code sample in the question is too complex, and most of the code is not related to the problem. Commented Jun 22, 2021 at 15:34

1 Answer 1

10

If the output of your _metric function is a Series, maybe you should use meta=('your series's columns name','output's dtype')

This worked for me.

Sign up to request clarification or add additional context in comments.

2 Comments

Could you please explain why using a tuple makes a difference here? This is not apparent from the documentation.
I'm sorry,I hadn't use dask for nearly 2 years,I guess meta parameter tell dask the part you want use and output type,maybe because if u don't set dtype,dask maybe infer an error dtpye

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.