0

I am just a little confused on how to create the spark udf. I have right now a function parse_xml and do the following:

spark.udf.register("parse_xml_udf", parse_xml)
parsed_df = xml_df.withColumn("parsed_xml", parse_xml_udf(xml_df["raw_xml"]))

where xml_df is the original spark df and raw_xml is the column I want to apply the function on.

I have seen a few places a line like spark_udf = udf(parse_xml, StringType()) -- what is the difference between this and the spark.udf.register line? Additionally, if I apply the function to that one column, is it applying it to each row? In other words, should my UDF be returning the output for one single row?

1 Answer 1

2
  • This spark.udf.register("squaredWithPython", squared) if you want to use with SQL like this: %sql select id, squaredWithPython(id) as id_squared from test

  • This squared_udf = udf(squared, LongType()) if just with data frame usage like this: display(df.select("id", squared_udf("id").alias("id_squared")))

That's all, but things not always clearly explained in the manuals.

Sign up to request clarification or add additional context in comments.

7 Comments

So if I want to use it like this: xml_df.withColumn('parsed_xml', parse_xml_udf(xml_df['raw_xml'])), I need to do it as udf(__)?
Yes, withColumn will apply to all rows unless you filter them.
Can do both, and see which works and learn...UDF should be enough, but try depends how you call, Spark has all sorts of quirks
yep from this distance
Thanks for the help.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.