0

I'm trying to run PySpark code, but I'm having problems with it.

versions:

  • OS: windows 11
  • apache spark: 4.0.1
  • java: 17.0.12 2024-07-16 LTS
  • python version: 3.11.9
  • hadoop: 3.3.6

Code:

from pyspark.sql import SparkSession
import os

os.environ["PYARROW_IGNORE_TIMEZONE"] = "1"
os.environ["SPARK_LOCAL_IP"] = "127.0.0.1"
os.environ["SPARK_LOCAL_HOSTNAME"] = "localhost"

spark = SparkSession.builder.appName("PySpark Test").getOrCreate()
data = [("Alice", 1), ("Bob", 2)]
df = spark.createDataFrame(data, ["Name", "ID"])
df.show()

Error text:

(.venv) PS D:\coding\source\test_tasks\test_tasks\products_spark> python -m src.test
WARNING: Using incubator modules: jdk.incubator.vector
Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Exception ignored in: <_io.BufferedRWPair object at 0x00000186DF5A5880>+ 1) / 1]
Traceback (most recent call last):
  File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.13_3.13.2032.0_x64__qbz5n2kfra8p0\Lib\socket.py", line 737, in write
OSError: [WinError 10038] ╤фхырэр яюя√Єър т√яюыэшЄ№ юяхЁрЎш■ эр юс·хъЄх, эх  ты ■∙хьё  ёюъхЄюь
25/09/28 18:29:03 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
org.apache.spark.SparkException: Python worker exited unexpectedly (crashed). Consider setting 'spark.sql.execution.pyspark.udf.faulthandler.enabled' or'spark.python.worker.faulthandler.enabled' configuration to 'true' for the better Python traceback.
        at org.apache.spark.api.python.BasePythonRunner$ReaderIterator$$anonfun$1.applyOrElse(PythonRunner.scala:621)
        at org.apache.spark.api.python.BasePythonRunner$ReaderIterator$$anonfun$1.applyOrElse(PythonRunner.scala:599)
        at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:35)
        at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:945)
        at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:925)
        at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:532)
        at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:601)
        at scala.collection.Iterator$$anon$9.hasNext(Iterator.scala:583)
        at scala.collection.Iterator$$anon$9.hasNext(Iterator.scala:583)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
        at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:50)
        at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:402)
        at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:901)
        at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:901)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:374)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:338)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:93)
        at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:171)
        at org.apache.spark.scheduler.Task.run(Task.scala:147)
        at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$5(Executor.scala:647)
        at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:80)
        at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:77)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:99)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:650)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
        at java.base/java.lang.Thread.run(Thread.java:842)
Caused by: java.net.SocketException: Connection reset
        at java.base/sun.nio.ch.SocketChannelImpl.throwConnectionReset(SocketChannelImpl.java:394)
        at java.base/sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:426)
        at org.apache.spark.api.python.BasePythonRunner$ReaderInputStream.read(PythonRunner.scala:756)
        at java.base/java.io.BufferedInputStream.fill(BufferedInputStream.java:244)
        at java.base/java.io.BufferedInputStream.read(BufferedInputStream.java:263)
        at java.base/java.io.DataInputStream.readInt(DataInputStream.java:393)
        at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:933)
        ... 26 more
25/09/28 18:29:03 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0) (localhost executor driver): org.apache.spark.SparkException: Python worker exited unexpectedly (crashed). Consider setting 'spark.sql.execution.pyspark.udf.faulthandler.enabled' or'spark.python.worker.faulthandler.enabled' configuration to 'true' for the better Python traceback.
        at org.apache.spark.api.python.BasePythonRunner$ReaderIterator$$anonfun$1.applyOrElse(PythonRunner.scala:621)
        at org.apache.spark.api.python.BasePythonRunner$ReaderIterator$$anonfun$1.applyOrElse(PythonRunner.scala:599)
        at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:35)
        at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:945)
        at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:925)
        at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:532)
        at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:601)
        at scala.collection.Iterator$$anon$9.hasNext(Iterator.scala:583)
        at scala.collection.Iterator$$anon$9.hasNext(Iterator.scala:583)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
        at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:50)
        at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:402)
        at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:901)
        at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:901)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:374)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:338)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:93)
        at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:171)
        at org.apache.spark.scheduler.Task.run(Task.scala:147)
        at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$5(Executor.scala:647)
        at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:80)
        at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:77)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:99)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:650)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
        at java.base/java.lang.Thread.run(Thread.java:842)
Caused by: java.net.SocketException: Connection reset
        at java.base/sun.nio.ch.SocketChannelImpl.throwConnectionReset(SocketChannelImpl.java:394)
        at java.base/sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:426)
        at org.apache.spark.api.python.BasePythonRunner$ReaderInputStream.read(PythonRunner.scala:756)
        at java.base/java.io.BufferedInputStream.fill(BufferedInputStream.java:244)
        at java.base/java.io.BufferedInputStream.read(BufferedInputStream.java:263)
        at java.base/java.io.DataInputStream.readInt(DataInputStream.java:393)
        at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:933)
        ... 26 more

25/09/28 18:29:03 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times; aborting job
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "D:\coding\source\test_tasks\test_tasks\products_spark\src\test.py", line 14, in <module>
    df.show()
  File "D:\coding\source\test_tasks\test_tasks\products_spark\.venv\Lib\site-packages\pyspark\sql\classic\dataframe.py", line 285, in show
    print(self._show_string(n, truncate, vertical))
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\coding\source\test_tasks\test_tasks\products_spark\.venv\Lib\site-packages\pyspark\sql\classic\dataframe.py", line 303, in _show_string
    return self._jdf.showString(n, 20, vertical)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\coding\source\test_tasks\test_tasks\products_spark\.venv\Lib\site-packages\py4j\java_gateway.py", line 1362, in __call__
    return_value = get_return_value(
                   ^^^^^^^^^^^^^^^^^
  File "D:\coding\source\test_tasks\test_tasks\products_spark\.venv\Lib\site-packages\pyspark\errors\exceptions\captured.py", line 282, in deco
    return f(*a, **kw)
           ^^^^^^^^^^^
  File "D:\coding\source\test_tasks\test_tasks\products_spark\.venv\Lib\site-packages\py4j\protocol.py", line 327, in get_return_value
    raise Py4JJavaError(
py4j.protocol.Py4JJavaError: An error occurred while calling o48.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0) (localhost executor driver): org.apache.spark.SparkException: Python worker exited unexpectedly (crashed). Consider setting 'spark.sql.execution.pyspark.udf.faulthandler.enabled' or'spark.python.worker.faulthandler.enabled' configuration to 'true' for the better Python traceback.
        at org.apache.spark.api.python.BasePythonRunner$ReaderIterator$$anonfun$1.applyOrElse(PythonRunner.scala:621)
        at org.apache.spark.api.python.BasePythonRunner$ReaderIterator$$anonfun$1.applyOrElse(PythonRunner.scala:599)
        at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:35)
        at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:945)
        at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:925)
        at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:532)
        at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:601)
        at scala.collection.Iterator$$anon$9.hasNext(Iterator.scala:583)
        at scala.collection.Iterator$$anon$9.hasNext(Iterator.scala:583)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
        at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:50)
        at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:402)
        at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:901)
        at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:901)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:374)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:338)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:93)
        at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:171)
        at org.apache.spark.scheduler.Task.run(Task.scala:147)
        at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$5(Executor.scala:647)
        at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:80)
        at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:77)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:99)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:650)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
        at java.base/java.lang.Thread.run(Thread.java:842)
Caused by: java.net.SocketException: Connection reset
        at java.base/sun.nio.ch.SocketChannelImpl.throwConnectionReset(SocketChannelImpl.java:394)
        at java.base/sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:426)
        at org.apache.spark.api.python.BasePythonRunner$ReaderInputStream.read(PythonRunner.scala:756)
        at java.base/java.io.BufferedInputStream.fill(BufferedInputStream.java:244)
        at java.base/java.io.BufferedInputStream.read(BufferedInputStream.java:263)
        at java.base/java.io.DataInputStream.readInt(DataInputStream.java:393)
        at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:933)
        ... 26 more

Driver stacktrace:
        at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$3(DAGScheduler.scala:2935)
        at scala.Option.getOrElse(Option.scala:201)
        at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2935)
        at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2927)
        at scala.collection.immutable.List.foreach(List.scala:334)
        at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2927)
        at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1295)
        at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1295)
        at scala.Option.foreach(Option.scala:437)
        at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1295)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:3207)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:3141)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:3130)
        at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:50)
        at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:1009)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2484)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2505)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2524)
        at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:544)
        at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:497)
        at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:58)
        at org.apache.spark.sql.classic.Dataset.collectFromPlan(Dataset.scala:2244)
        at org.apache.spark.sql.classic.Dataset.$anonfun$head$1(Dataset.scala:1379)
        at org.apache.spark.sql.classic.Dataset.$anonfun$withAction$2(Dataset.scala:2234)
        at org.apache.spark.sql.execution.QueryExecution$.withInternalError(QueryExecution.scala:654)
        at org.apache.spark.sql.classic.Dataset.$anonfun$withAction$1(Dataset.scala:2232)
        at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId0$8(SQLExecution.scala:163)
        at org.apache.spark.sql.execution.SQLExecution$.withSessionTagsApplied(SQLExecution.scala:272)
        at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId0$7(SQLExecution.scala:125)
        at org.apache.spark.JobArtifactSet$.withActiveJobArtifactState(JobArtifactSet.scala:94)
        at org.apache.spark.sql.artifact.ArtifactManager.$anonfun$withResources$1(ArtifactManager.scala:112)
        at org.apache.spark.sql.artifact.ArtifactManager.withClassLoaderIfNeeded(ArtifactManager.scala:106)
        at org.apache.spark.sql.artifact.ArtifactManager.withResources(ArtifactManager.scala:111)
        at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId0$6(SQLExecution.scala:125)
        at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:295)
        at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId0$1(SQLExecution.scala:124)
        at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:804)
        at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId0(SQLExecution.scala:78)
        at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:237)
        at org.apache.spark.sql.classic.Dataset.withAction(Dataset.scala:2232)
        at org.apache.spark.sql.classic.Dataset.head(Dataset.scala:1379)
        at org.apache.spark.sql.Dataset.take(Dataset.scala:2810)
        at org.apache.spark.sql.classic.Dataset.getRows(Dataset.scala:339)
        at org.apache.spark.sql.classic.Dataset.showString(Dataset.scala:375)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:568)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)
        at py4j.Gateway.invoke(Gateway.java:282)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:184)
        at py4j.ClientServerConnection.run(ClientServerConnection.java:108)
        at java.base/java.lang.Thread.run(Thread.java:842)
Caused by: org.apache.spark.SparkException: Python worker exited unexpectedly (crashed). Consider setting 'spark.sql.execution.pyspark.udf.faulthandler.enabled' or'spark.python.worker.faulthandler.enabled' configuration to 'true' for the better Python traceback.
        at org.apache.spark.api.python.BasePythonRunner$ReaderIterator$$anonfun$1.applyOrElse(PythonRunner.scala:621)
        at org.apache.spark.api.python.BasePythonRunner$ReaderIterator$$anonfun$1.applyOrElse(PythonRunner.scala:599)
        at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:35)
        at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:945)
        at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:925)
        at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:532)
        at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:601)
        at scala.collection.Iterator$$anon$9.hasNext(Iterator.scala:583)
        at scala.collection.Iterator$$anon$9.hasNext(Iterator.scala:583)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
        at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:50)
        at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:402)
        at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:901)
        at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:901)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:374)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:338)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:93)
        at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:171)
        at org.apache.spark.scheduler.Task.run(Task.scala:147)
        at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$5(Executor.scala:647)
        at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:80)
        at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:77)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:99)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:650)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:99)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:650)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:99)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:650)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:99)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:650)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:99)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:650)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:99)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:650)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:99)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:99)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:99)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:99)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:650)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
        ... 1 more
Caused by: java.net.SocketException: Connection reset
        at java.base/sun.nio.ch.SocketChannelImpl.throwConnectionReset(SocketChannelImpl.java:394)
        at java.base/sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:426)
        at org.apache.spark.api.python.BasePythonRunner$ReaderInputStream.read(PythonRunner.scala:756)
        at java.base/java.io.BufferedInputStream.fill(BufferedInputStream.java:244)
        at java.base/java.io.BufferedInputStream.read(BufferedInputStream.java:263)
        at java.base/java.io.DataInputStream.readInt(DataInputStream.java:393)
        at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:933)
        ... 26 more

I have already tried changing the Python version, updating Java, downloading winutils from the GitHub repository, and creating new system environment paths (JAVA_HOME, HADOOP_HOME, and SPARK_HOME).

3
  • 1
    error shows OSError: [WinError 10038] ╤фхырэр яюя√Єър т√яюыэшЄ№ юяхЁрЎш■ эр юс·хъЄх, эх ты ■∙хьё ёюъхЄюь maybe it sends text which is not in UTF-8 and this makes problem. Commented Sep 28 at 15:42
  • 1
    Have you considered setting spark.sql.execution.pyspark.udf.faulthandler.enabled as suggested in the error message? Commented Sep 28 at 15:46
  • other question with the same error OSError: [WinError 10038] so maybe it is the same (or similar) problem python socket programming OSError:[WinError 10038] an operation was attempted on something that is not a socket - Stack Overflow Commented Sep 28 at 16:30

1 Answer 1

-1

Python worker is getting killed mid stream when the data is serialized and sent via socket to the python worker. For some reason, the python worker crashes unexpectedly and the socket is closed. JVM is reporting the same error to the driver.

Can you trying increasing the overhead memory?

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.