3

I have trouble getting different versions of PySpark to work correctly on my windows machine in combination with different versions of Python installed via PyEnv.

The setup:

  1. I installed pyenv and let it set the environment variables (PYENV, PYENV_HOME, PYENV_ROOT and the entry in PATH)
  2. I installed Amazon Coretto Java JDK (jdk1.8.0_412) and set the JAVA_HOME environment variable.
  3. I downloaded the winutils.exe & hadoop.dll from here and set the HADOOP_HOME environment variable.
  4. Via pyenv I installed Python 3.10.10 and then pyspark 3.4.1
  5. Via pyenv I installed Python 3.8.10 and then pyspark 3.2.1

Python works as expected:

  • I can switch between different versions with pyenv global <version>
  • When I use python --version in PowerShell it always shows the version that I set before with pyenv.

But I'm having trouble with PySpark.

For one, I cannot start PySpark via the powershell console by running pyspark >>> The term 'pyspark' is not recognized as the name of a cmdlet, function, script file.....

More annoyingly, my repo-scripts (with a .venv created via pyenv & poetry) also fail:

  • Caused by: java.io.IOException: Cannot run program "python3": CreateProcess error=2, The system cannot find the file specified [...] Caused by: java.io.IOException: CreateProcess error=2, The system cannot find the file specified

However, both work after I add the following two entries to the PATH environment variable:

  • C:\Users\myuser\.pyenv\pyenv-win\versions\3.10.10
  • C:\Users\myuser\.pyenv\pyenv-win\versions\3.10.10\Scripts

but I would have to "hardcode" the Python Version - which is exactly what I don't want to do while using pyenv.

If I hardcode the path, even if I switch to another Python version (pyenv global 3.8.10), once I run pyspark in Powershell, the version PySpark 3.4.1 starts from the environment PATH entry for Python 3.10.10. I also cannot just do anything with python in the command line as it always points to the hardcoded python version, no matter what I do with pyenv.

I was hoping to be able to start PySpark 3.2.1 from Python 3.8.10 which I just "activated" with pyenv globally.

What do I have to do to be able to switch between the Python installations (and thus also between PySparks) with pyenv without "hardcoding" the Python paths?

Example PySpark script:

from pyspark.sql import SparkSession
spark = (
    SparkSession
    .builder
    .master("local[*]")
    .appName("myapp")
    .getOrCreate()
)
data = [("Finance", 10),
        ("Marketing", 20),
        ]
df = spark.createDataFrame(data=data)
df.show(10, False)
2
  • have you tried using seperate virtual envs for pysprk versions? u can use something like pyenv-virtualenv Commented May 5, 2024 at 14:55
  • Sure, i use virtualenv and poetry even creates separate ones for different repositories. But that's not the issue here - the issue is that windows cannot call on the python / pyspark versions dynamically set by pyenv, instead I have to hardcode the environment variables to a specific versions. Commented May 6, 2024 at 15:28

1 Answer 1

0

I "solved" the issue by completely removing the Python path from the PATH environment variable and doing everything exclusively via pyenv. I suppose my original task is not possible.

I can still start a Python process by running pyenv exec python in the terminal.

But disappointingly I cannot launch a Spark process from the terminal anymore.

At least my repositories work as expected when setting the pyenv versions (pyenv local 3.8.10 / pyenv global 3.10.10).

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.