UDF and python function in Pyspark

Question

I have a dataframe in Pyspark as:

listA = [(1,'AAA','USA'),(2,'XXX','CHN'),(3,'KKK','USA'),(4,'PPP','USA'),(5,'EEE','USA'),(5,'HHH','THA')]
df = spark.createDataFrame(listA, ['id', 'name','country'])

and I have created a dictionary as:

thedict={"USA":"WASHINGTON","CHN":"BEIJING","DEFAULT":"KEY NOT FOUND"}

and Then I created a UDF to get the matching key values from dictionary.

def my_func(letter):
    if(thedict.get(letter) !=None):
        return thedict.get(letter)
    else:
        return thedict.get("DEFAULT")

I am getting below error when trying to call function as:

df.withColumn('CAPITAL',my_func(df.country))

  File "<stdin>", line 1, in <module>
  File "/usr/hdp/current/spark2-client/python/pyspark/sql/dataframe.py", line 1848, in withColumn
    assert isinstance(col, Column), "col should be Column"
AssertionError: col should be Column

Whereas if I embedded it with pyspark.sql.functions, it's working fine.

from pyspark.sql.functions import col, udf
udfdict = udf(my_func,StringType())

df.withColumn('CAPITAL',udfdict(df.country)).show()

+---+----+-------+-------------+
| id|name|country|      CAPITAL|
+---+----+-------+-------------+
|  1| AAA|    USA|   WASHINGTON|
|  2| XXX|    CHN|      BEIJING|
|  3| KKK|    USA|   WASHINGTON|
|  4| PPP|    USA|   WASHINGTON|
|  5| EEE|    USA|   WASHINGTON|
|  5| HHH|    THA|KEY NOT FOUND|
+---+----+-------+-------------+

I couldn't understand what is the difference in these two calls?

Every function that you wrote and need to be applied on columns, you have to transform it to an pyspark UDF and then use it! — Ali AzG
– Ali AzG, Commented Dec 5, 2018 at 11:07
@Ali. That doesn't seems to be true. see below code. It's working fine. listA = [('A',10,20,40,60),('B',10,10,10,40)] df = spark.createDataFrame(listA, ['id', 'M1','M2','M3','M4']) def add_column(*args): num=0 for i in args: num = num +i return num newdf = df.withColumn('TOTAL', add_column(df.M1,df.M2,df.M3)) — Vikrant Singh Rana
– Vikrant Singh Rana, Commented Dec 5, 2018 at 11:21
no I have created a python udf as :def add_column(*args): num=0 for i in args: num = num +i return num — Vikrant Singh Rana
– Vikrant Singh Rana, Commented Dec 5, 2018 at 11:30
Yeah, I see that. have you tried any functions like add_column and without using udf apply it on columns? — Ali AzG
– Ali AzG, Commented Dec 5, 2018 at 11:34
I don't know exactly what is the problem, but I suggest you use udf to your work gets done! — Ali AzG
– Ali AzG, Commented Dec 5, 2018 at 11:45

Manoj Singh · Accepted Answer · 2018-12-05 15:06:06Z

2

UDF functions have special properties in that they take column/s and apply the logic row-wise to produce a new column. whereas a common python function takes only one discrete argument and produces a single output.

And thats what the error is about. The returned value from function is not a column

assert isinstance(col, Column), "col should be Column"

You can define udf in two ways:

myudf = udf(LAMBDA_EXPRESSION, RETURN_TYPE )
myudf = udf(CUSTOM_FUNCTION, RETURN_TYPE)

answered Dec 5, 2018 at 15:06

Manoj Singh

1,73714 silver badges22 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Vikrant Singh Rana Over a year ago

I get that and thanks for ur input. But why it's working in the example explained above in comment section. There also I am calling python function using withCoulmn and passing multiple arguments to it. Am I missing something ?

Collectives™ on Stack Overflow

UDF and python function in Pyspark

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related