0

i am new to pyspark.

i have installed java 17 and made sure it works

C:\Windows\System32>java -version

java version "17.0.12" 2024-07-16 LTS

installed python 3.9 and made sure it works

C:\Windows\System32>python --version

Python 3.9.13

copied winutils.exe and placed in a folder C:\winutils\bin

set HADOOP_HOME = C:\winutils

then i ran

C:\Windows\System32>pip -install pyspark

C:\Windows\System32>pip -install "pyspark[sql]"

C:\Windows\System32>pip -install findspark

then i ran
C:\Windows\System32>pyspark
and got a spark session going

Python 3.9.13 (tags/v3.9.13:6de2ca5, May 17 2022, 16:36:42) [MSC v.1929 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
WARNING: Using incubator modules: jdk.incubator.vector
Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/11/15 14:01:27 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 4.0.1
      /_/

Using Python version 3.9.13 (tags/v3.9.13:6de2ca5, May 17 2022 16:36:42)
Spark context Web UI available at http://LAPTOP-FE5VVC1N:4040
Spark context available as 'sc' (master = local[*], app id = local-1763244089545).
SparkSession available as 'spark'.
>>>

at the prompt i ran the following

\>>> import findspark

\>>> findspark.init()

\>>> data = [("Alice", 25), ("Bob", 30), ("Cathy", 29)]

\>>> columns = ["name", "age"]

\>>> df = spark.createDataFrame(data, columns)

\>>>

everything is fine upto this point.

now if i try to run either df.show() or df.count()
i get the py4j.protocol.Py4JJavaError

my environment variables are as follows

HADOOP_HOME=C:\winutils

JAVA_HOME=C:\Program Files\Java\jdk-17

PYSPARK_DRIVER_PYTHON=python

PYSPARK_PYTHON=pythonC:\Program Files\Python39\Scripts\

my path variable has the following entries

C:\Program Files\Python39\Scripts\

C:\Program Files\Python39\

C:\winutils\bin

C:\Program Files\Java\jdk-17\bin

any help will be appreciated

New contributor
Blogger Anonymous is a new contributor to this site. Take care in asking for clarification, commenting, and answering. Check out our Code of Conduct.

1 Answer 1

0

Problems like this often come from version incompatibilities between Spark, Python, Java, or Hadoop. In my case, the following combination works without issues: Python 3.11, Java 17, Hadoop 3.3.6, and PySpark 3.5.1.

New contributor
gianfranco de siena is a new contributor to this site. Take care in asking for clarification, commenting, and answering. Check out our Code of Conduct.
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.