2

I have a dataframe in Pyspark as:

listA = [(1,'AAA','USA'),(2,'XXX','CHN'),(3,'KKK','USA'),(4,'PPP','USA'),(5,'EEE','USA'),(5,'HHH','THA')]
df = spark.createDataFrame(listA, ['id', 'name','country'])

and I have created a dictionary as:

thedict={"USA":"WASHINGTON","CHN":"BEIJING","DEFAULT":"KEY NOT FOUND"}

and Then I created a UDF to get the matching key values from dictionary.

def my_func(letter):
    if(thedict.get(letter) !=None):
        return thedict.get(letter)
    else:
        return thedict.get("DEFAULT")

I am getting below error when trying to call function as:

df.withColumn('CAPITAL',my_func(df.country))

  File "<stdin>", line 1, in <module>
  File "/usr/hdp/current/spark2-client/python/pyspark/sql/dataframe.py", line 1848, in withColumn
    assert isinstance(col, Column), "col should be Column"
AssertionError: col should be Column

Whereas if I embedded it with pyspark.sql.functions, it's working fine.

from pyspark.sql.functions import col, udf
udfdict = udf(my_func,StringType())

df.withColumn('CAPITAL',udfdict(df.country)).show()

+---+----+-------+-------------+
| id|name|country|      CAPITAL|
+---+----+-------+-------------+
|  1| AAA|    USA|   WASHINGTON|
|  2| XXX|    CHN|      BEIJING|
|  3| KKK|    USA|   WASHINGTON|
|  4| PPP|    USA|   WASHINGTON|
|  5| EEE|    USA|   WASHINGTON|
|  5| HHH|    THA|KEY NOT FOUND|
+---+----+-------+-------------+

I couldn't understand what is the difference in these two calls?

6
  • Every function that you wrote and need to be applied on columns, you have to transform it to an pyspark UDF and then use it! Commented Dec 5, 2018 at 11:07
  • @Ali. That doesn't seems to be true. see below code. It's working fine. listA = [('A',10,20,40,60),('B',10,10,10,40)] df = spark.createDataFrame(listA, ['id', 'M1','M2','M3','M4']) def add_column(*args): num=0 for i in args: num = num +i return num newdf = df.withColumn('TOTAL', add_column(df.M1,df.M2,df.M3)) Commented Dec 5, 2018 at 11:21
  • no I have created a python udf as :def add_column(*args): num=0 for i in args: num = num +i return num Commented Dec 5, 2018 at 11:30
  • Yeah, I see that. have you tried any functions like add_column and without using udf apply it on columns? Commented Dec 5, 2018 at 11:34
  • 1
    I don't know exactly what is the problem, but I suggest you use udf to your work gets done! Commented Dec 5, 2018 at 11:45

1 Answer 1

2

UDF functions have special properties in that they take column/s and apply the logic row-wise to produce a new column. whereas a common python function takes only one discrete argument and produces a single output.

And thats what the error is about. The returned value from function is not a column

assert isinstance(col, Column), "col should be Column"

You can define udf in two ways:

  1. myudf = udf(LAMBDA_EXPRESSION, RETURN_TYPE )
  2. myudf = udf(CUSTOM_FUNCTION, RETURN_TYPE)
Sign up to request clarification or add additional context in comments.

1 Comment

I get that and thanks for ur input. But why it's working in the example explained above in comment section. There also I am calling python function using withCoulmn and passing multiple arguments to it. Am I missing something ?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.