0

Hi I am facing an error with providing dependency jars for spark-submit in kubernetes.

/usr/middleware/spark-3.1.1-bin-hadoop3.2/bin/spark-submit --master k8s://https://112.23.123.23:6443 --deploy-mode cluster --name spark-postgres-minio-kubernetes  --jars file:///AirflowData/kubernetes/externalJars/postgresql-42.2.14.jar  --driver-class-path file:///AirflowData/kubernetes/externalJars/postgresql-42.2.14.jar --conf spark.executor.instances=1 --conf spark.kubernetes.namespace=spark --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark --conf spark.kubernetes.file.upload.path=s3a://daci-dataintegration/spark-operator-on-k8s/code --conf spark.hadoop.fs.s3a.fast.upload=true --conf spark.kubernetes.container.image=hostname:5000/spark-py:spark3.1.2  file:///AirflowData/kubernetes/python/postgresminioKube.py

Below is the code to execute. The jars needed for the S3 minio and configurations are placed in the spark_home/conf and spark_home/jars and the docker image is created.

from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql import functions as F
spark = SparkSession.builder.appName("Postgres-Minio-Kubernetes").getOrCreate()
import json
#spark = SparkSession.builder.config('spark.driver.extraClassPath', '/hadoop/externalJars/db2jcc4.jar').getOrCreate()
jdbcUrl = "jdbc:postgresql://{0}:{1}/{2}".format("hosnamme", "port", "db")
connectionProperties = {
  "user" : "username",
  "password" : "password",
  "driver": "org.postgresql.Driver",
  "fetchsize" : "100000"
}
pushdown_query = "(select * from public.employees) emp_als"
df = spark.read.jdbc(url=jdbcUrl, table=pushdown_query, column="employee_id", lowerBound=1, upperBound=100, numPartitions=2, properties=connectionProperties)
df.write.format('csv').options(delimiter=',').mode('overwrite').save('s3a://daci-dataintegration/spark-operator-on-k8s/data/postgres-minio-csv/')
df.write.format('parquet').options(delimiter='|').options(header=True).mode('overwrite').save('s3a://daci-dataintegration/spark-operator-on-k8s/data/postgres-minio-csv/')

Error is below . It is trying to execute the jar for some reason

21/11/09 17:05:44 INFO SparkContext: Added JAR file:/tmp/spark-d987d7e7-9d49-4523-8415-1e438da1730e/postgresql-42.2.14.jar at spark://spark-postgres-minio-kubernetes-49d7d77d05a980e5-driver-svc.spark.svc:7078/jars/postgresql-42.2.14.jar with timestamp 1636477543573

21/11/09 17:05:49 ERROR TaskSchedulerImpl: Lost executor 1 on 192.168.216.12: Unable to create executor due to ./postgresql-42.2.14.jar
5
  • Is there a stack trace belonging to the ERROR line? Commented Nov 16, 2021 at 19:53
  • Is the path /AirflowData/kubernetes/externalJars/postgresql-42.2.14.jar on the machine from which you submit or inside that hostname:5000/spark-py:spark3.1.2 container image or somewhere else? Commented Nov 16, 2021 at 20:07
  • ... either way, please have a close look at the Dependency Management section of the Spark on Kubernetes documentation. Commented Nov 16, 2021 at 20:09
  • The path is mounted on all nodes. But it works only if the jar are built in inside the image. Commented Nov 22, 2021 at 13:12
  • Were you still using the command you posted originally? Changing it according to the doc section I referenced might help. Commented Dec 14, 2021 at 22:16

1 Answer 1

0

The external jars are getting added to the /opt/spark/work-dir and it didnt had access. So i changed the dockerfile to have access to the folder and then it worked.

RUN chmod 777 /opt/spark/work-dir
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.