I'm starting to use @pandas_udf for pyspark and while testing with their examples from documentation I find an error that I'm not able to solve.
The code I'm running is:
from pyspark.sql import SparkSession
from pyspark.sql.functions import pandas_udf, PandasUDFType
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame(
[(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
("id", "v"))
@pandas_udf("id long, v double", PandasUDFType.GROUPED_MAP)
def subtract_mean(pdf):
# pdf is a pandas.DataFrame
v = pdf.v
return pdf.assign(v=v - v.mean())
df.groupby("id").apply(subtract_mean).show()
And the error I get is:
Py4JJavaError: An error occurred while calling o53.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 44 in stage 7.0 failed 1 times, most recent failure: Lost task 44.0 in stage 7.0 (TID 132, localhost, executor driver): java.lang.IllegalArgumentException: capacity < 0: (-1 < 0)
I'm using:
pyspark 2.4.5
py4j 0.10.7
pyarrow 0.15.1