4

I have a dataframe named 'new_emp_final_1'. When I try to derive a column 'difficulty' from cookTime and prepTime, by calling the function difficulty from a udf, it is giving me error.

new_emp_final_1.dtypes is below -

[('name', 'string'), ('ingredients', 'string'), ('url', 'string'), ('image', 'string'), ('cookTime', 'string'), ('recipeYield', 'string'), ('datePublished', 'strin
g'), ('prepTime', 'string'), ('description', 'string')]

Result of new_emp_final_1.schema is -

StructType(List(StructField(name,StringType,true),StructField(ingredients,StringType,true),StructField(url,StringType,true),StructField(image,StringType,true),StructField(cookTime,StringType,true),StructField(recipeYield,StringType,true),StructField(datePublished,StringType,true),StructField(prepTime,StringType,true),StructField(description,StringType,true)))

Code:

def difficulty(cookTime, prepTime):   
    if not cookTime or not prepTime:
        return "Unkown"

    total_duration = cookTime + prepTime
    if total_duration > 3600:
        return "Hard"
    elif total_duration > 1800 and total_duration < 3600:
        return "Medium"
    elif total_duration < 1800:
        return "Easy" 
    else: 
        return "Unkown"

func_udf = udf(difficulty, IntegerType())
new_emp_final_1 = new_emp_final_1.withColumn("difficulty", func_udf(new_emp_final_1.cookTime, new_emp_final_1.prepTime))
new_emp_final_1.show(20,False)

Error is -

File "/home/raghavcomp32915/mypycode.py", line 56, in <module> func_udf = udf(difficulty, IntegerType()) File "/usr/hdp/current/spark2-client/python/lib/pyspark.zip/pyspark/sql/udf.py", line 186, in wrapper File "/usr/hdp/current/spark2-client/python/lib/pyspark.zip/pyspark/sql/udf.py", line 166, in __call__ File "/usr/hdp/current/spark2-client/python/lib/pyspark.zip/pyspark/sql/column.py", line 66, in _to_seq File "/usr/hdp/current/spark2-client/python/lib/pyspark.zip/pyspark/sql/column.py", line 54, in _to_java_column TypeError: Invalid argument, not a string or column: <function difficulty at 0x7f707e9750c8> of type <type 'function'>. For column literals, use 'lit', 'array', 's truct' or 'create_map' function.

I am expecting a column named difficulty in existing dataframe new_emp_final_1 with values as Hard, Medium, Easy or Unknown.

3 Answers 3

19

I ran into this issue with Python’s sum because there was a conflict with Spark’s SQL sum — a real-life illustration of why this :

from pyspark.sql.functions import *

is bad.

It goes without saying that the solution was to either restrict the import to the needed functions or to import pyspark.sql.functions and prefix the needed functions with it.

Sign up to request clarification or add additional context in comments.

Comments

5

Looking into udf (difficulty), I have seen 2 things:

  • you are trying to sum 2 strings in the udf (cookTime and prepTime)
  • the udf should return StringType()

This example worked for me:

from pyspark.sql.types import StringType, StructType, StructField, IntegerType
import pandas as pd

schema = StructType([StructField("name", StringType(), True), 
                 StructField('ingredients',StringType(),True), 
                 StructField('url',StringType(),True), 
                 StructField('image',StringType(),True), 
                 StructField('cookTime',StringType(),True), 
                 StructField('recipeYield',StringType(),True), 
                 StructField('datePublished',StringType(),True), 
                 StructField('prepTime',StringType(),True), 
                 StructField('description',StringType(),True)])


data = {
    "name": ['meal1', 'meal2'],
    "ingredients": ['ingredient11, ingredient12','ingredient21, ingredient22'],
    "url": ['URL1', 'URL2'],
    "image": ['Image1', 'Image2'],
    "cookTime": ['60', '3601'],
    "recipeYield": ['recipeYield1', 'recipeYield2'],
    "prepTime": ['0','3000'],
    "description": ['desc1','desc2']
    }

new_emp_final_1_pd = pd.DataFrame(data=data)
new_emp_final_1 = spark.createDataFrame(new_emp_final_1_pd)

def difficulty(cookTime, prepTime):   
    if not cookTime or not prepTime:
        return "Unkown"

    total_duration = int(cookTime) + int(prepTime)
    if total_duration > 3600:
        return "Hard"
    elif total_duration > 1800 and total_duration < 3600:
        return "Medium"
    elif total_duration < 1800:
        return "Easy" 
    else: 
        return "Unkown"

func_udf = udf(difficulty, StringType())
new_emp_final_1 = new_emp_final_1.withColumn("difficulty", 
func_udf(new_emp_final_1.cookTime, new_emp_final_1.prepTime))
new_emp_final_1.show(20,False)

2 Comments

Yes, the above both points helped me indeed. Also I was using the udf as a variable in above steps which are not displayed here, which was also causing to throw the error.
Yep.. that was the reason
0

Have you tried sending the literal values of cookTime and prepTime like this :

new_emp_final_1 = new_emp_final_1.withColumn("difficulty", func_udf(new_emp_final_1.lit(cookTime), new_emp_final_1.lit(prepTime)))

4 Comments

I tried changing the function's name but it still gives the same error.
Try the edited answer if it helps?! @RaghavendraGupta
Tried with StringType() as well, but no luck.
Edited the answer

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.