Pandas User-Defined Function Py4JJavaError

Question

I'm starting to use @pandas_udf for pyspark and while testing with their examples from documentation I find an error that I'm not able to solve.

The code I'm running is:

from pyspark.sql import SparkSession
from pyspark.sql.functions import pandas_udf, PandasUDFType

spark = SparkSession.builder.getOrCreate()

df = spark.createDataFrame(
    [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
    ("id", "v"))

@pandas_udf("id long, v double", PandasUDFType.GROUPED_MAP)
def subtract_mean(pdf):
    # pdf is a pandas.DataFrame
    v = pdf.v
    return pdf.assign(v=v - v.mean())

df.groupby("id").apply(subtract_mean).show()

And the error I get is:

Py4JJavaError: An error occurred while calling o53.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 44 in stage 7.0 failed 1 times, most recent failure: Lost task 44.0 in stage 7.0 (TID 132, localhost, executor driver): java.lang.IllegalArgumentException: capacity < 0: (-1 < 0)

I'm using:

pyspark                   2.4.5
py4j                      0.10.7            
pyarrow                   0.15.1

Ranga Vure · Accepted Answer · 2020-05-12 09:44:57Z

1

This is issue using PyArrow version > 0.15 with Spark 2.4.x, please follow this link to fix https://issues.apache.org/jira/browse/SPARK-29367

answered May 12, 2020 at 9:44

Ranga Vure

1,9323 gold badges16 silver badges25 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Pandas User-Defined Function Py4JJavaError

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related