Problem connecting SQL Server from Pyspark 2.4 to write data

Question

I am using Pyspark 2.4, want to write data to SQL Server, which isn't working.

I've placed the jar file downloaded from here in the spark path:

D:\spark-2.4.3-bin-hadoop2.7\spark-2.4.3-bin-hadoop2.7\jars\

But, to no avail. Following is the pyspark code to write the data into the SQL Server.

sql_server_dtls = {'user': 'john', 'password': 'doe'}

ports_budget_joined_DF.write.jdbc(url="jdbc:sqlserver://endpoint:1433;databaseName=poc", table='dbo.test_tmp', mode='overwrite', properties=sql_server_dtls)

Getting below error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\aakash.basu\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pyspark\sql\readwriter.py", line 982, in jdbc
    self.mode(mode)._jwrite.jdbc(url, table, jprop)
  File "C:\Users\aakash.basu\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pyspark\python\lib\py4j-0.10.7-src.zip\py4j\java_gateway.py", line 1257, in __call__
  File "C:\Users\aakash.basu\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pyspark\sql\utils.py", line 63, in deco
    return f(*a, **kw)
  File "C:\Users\aakash.basu\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pyspark\python\lib\py4j-0.10.7-src.zip\py4j\protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o45.jdbc.
: java.sql.SQLException: No suitable driver

Am I missing out on something? Also, I want to truncate the table first before writing the new data to it. Does mode='overwrite' in the DF writer handle the same for SQL Server target system as well?

Possible duplicate of The infamous java.sql.SQLException: No suitable driver found — Max
– Max, Commented Sep 18, 2019 at 7:38

pissall · Accepted Answer · 2019-09-18 15:11:09Z

1

You just need com.mysql.cj.jdbc.Driver, which Spark can automatically download in whatever directory it is looking for it in.

Use this function:

def connect_to_sql(
    spark, jdbc_hostname, jdbc_port, database, data_table, username, password
):

    jdbc_url = "jdbc:mysql://{0}:{1}/{2}".format(jdbc_hostname, jdbc_port, database)

    connection_details = {
        "user": username,
        "password": password,
        "driver": "com.mysql.cj.jdbc.Driver",
    }

    df = spark.read.jdbc(url=jdbc_url, table=data_table, properties=connection_details)
    return df

ADDITION:

You can use the below function (you can edit it to your use) to pass packages while declaring your sparkSession(). You can pass the artifact ID's of the packages in a list, or as comma separated string. You can get them from The central repository

def create_spark_session(master_url, packages=None):
    """
    Creates a spark session
    :param master_url: IP address of the cluster you want to submit the job to or local with all cores
    :param packages: Any external packages if needed, only when called. This variable could be a string of the package
        specification or a list of package specifications.
    :return: spark session object
    """
    if packages:
        packages = ",".join(packages) if isinstance(packages, list) else packages
        spark = (
            SparkSession.builder.master(master_url)
            .config("spark.io.compression.codec", "snappy")
            .config("spark.ui.enabled", "false")
            .config("spark.jars.packages", packages)
            .getOrCreate()
        )
    else:
        spark = (
            SparkSession.builder.master(master_url)
            .config("spark.io.compression.codec", "snappy")
            .config("spark.ui.enabled", "false")
            .getOrCreate()
        )

    return spark

edited Sep 18, 2019 at 15:11

answered Sep 18, 2019 at 8:48

pissall

7,4442 gold badges29 silver badges47 bronze badges

Sign up to request clarification or add additional context in comments.

12 Comments

Aakash Basu Over a year ago

Sorry for late update, I just did that. But, it is very slow for bigger datasets, probably because a row-wise load is happening. Can the load performance be increased?

Aakash Basu Over a year ago

Second point is, even if I'm giving driver option, it is failing if I don't pass the jar as --jars parameter when I submit the job or start the pyspark shell. To my understanding, when we provide driver, it downloads from internet and doesn't look for the jar in jars path inside spark or the jars parameter, but its not happening in my local machine's test pyspark shell. Any idea, why?

pissall Over a year ago

@AakashBasu Spark will download and save it to a jar path, and next time it will look for it in the same path (unless the environment doesn't change). If we pass the --jars option in the console, I think it looks straight into the jars directory (without trying to download it). I will update my answer to include something more.

Aakash Basu Over a year ago

Can you pls help me with the first comment? It is very slow, how to speed up the write to SQL Server?

pissall Over a year ago

@AakashBasu Are you filtering the data after importing it into spark?

|

Collectives™ on Stack Overflow

Problem connecting SQL Server from Pyspark 2.4 to write data

1 Answer 1

12 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

12 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related