py4j.protocol.Py4JJavaError on windows machine running pyspark code in executing df.show() or df.count() comman

Question

i am new to pyspark.

i have installed java 17 and made sure it works

C:\Windows\System32>java -version

java version "17.0.12" 2024-07-16 LTS

installed python 3.9 and made sure it works

C:\Windows\System32>python --version

Python 3.9.13

copied winutils.exe and placed in a folder C:\winutils\bin

set HADOOP_HOME = C:\winutils

then i ran

C:\Windows\System32>pip -install pyspark

C:\Windows\System32>pip -install "pyspark[sql]"

C:\Windows\System32>pip -install findspark

then i ran
C:\Windows\System32>pyspark
and got a spark session going

Python 3.9.13 (tags/v3.9.13:6de2ca5, May 17 2022, 16:36:42) [MSC v.1929 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
WARNING: Using incubator modules: jdk.incubator.vector
Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/11/15 14:01:27 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 4.0.1
      /_/

Using Python version 3.9.13 (tags/v3.9.13:6de2ca5, May 17 2022 16:36:42)
Spark context Web UI available at http://LAPTOP-FE5VVC1N:4040
Spark context available as 'sc' (master = local[*], app id = local-1763244089545).
SparkSession available as 'spark'.
>>>

at the prompt i ran the following

\>>> import findspark

\>>> findspark.init()

\>>> data = [("Alice", 25), ("Bob", 30), ("Cathy", 29)]

\>>> columns = ["name", "age"]

\>>> df = spark.createDataFrame(data, columns)

\>>>

everything is fine upto this point.

now if i try to run either df.show() or df.count()
i get the py4j.protocol.Py4JJavaError

my environment variables are as follows

HADOOP_HOME=C:\winutils

JAVA_HOME=C:\Program Files\Java\jdk-17

PYSPARK_DRIVER_PYTHON=python

PYSPARK_PYTHON=pythonC:\Program Files\Python39\Scripts\

my path variable has the following entries

C:\Program Files\Python39\Scripts\

C:\Program Files\Python39\

C:\winutils\bin

C:\Program Files\Java\jdk-17\bin

any help will be appreciated

gianfranco de siena · Accepted Answer · 2025-11-20 23:50:58Z

0

Problems like this often come from version incompatibilities between Spark, Python, Java, or Hadoop. In my case, the following combination works without issues: Python 3.11, Java 17, Hadoop 3.3.6, and PySpark 3.5.1.

answered yesterday

Collectives™ on Stack Overflow

py4j.protocol.Py4JJavaError on windows machine running pyspark code in executing df.show() or df.count() comman

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related