1

I am trying to query an SQL database via jdbc connection in databricks and store the query results as a pandas dataframe. All of the methods I can find for this online involve storing it as a type of Spark object first using Scala code and then converting this to pandas. I tried for cell 1:

%scala
val df_table1 = sqlContext.read.format("jdbc").options(Map(
    ("url" -> "jdbc:sqlserver://myserver.database.windows.net:1433;database=mydb"),
    ("dbtable" -> "(select top 10 * from myschema.table) as table"),
    ("user" -> "user"),
    ("password" -> "password123"),
    ("driver" -> "com.microsoft.sqlserver.jdbc.SQLServerDriver"))
).load()

which results in:

df_table1: org.apache.spark.sql.DataFrame = [var1: int, var2: string ... 50 more fields]

Great! But when I try to convert it to a pandas df in cell 2 so I can use it:

import numpy as np
import pandas as pd 

result_pdf = df_table1.select("*").toPandas()

print(result_pdf)

It generates the error message:

NameError: name 'df_table1' is not defined

How do I successfully convert this object to a pandas dataframe, or alternatively is there any way of querying the SQL database via jdbc connection using python code without needing to use Scala at all (I do not particularly like Scala syntax and would rather avoid it if at all possible)?

2
  • Have you tried loading the data directly to pandas? It has a method exactly for this - pd.read_sql(query, sql_conn) Commented May 29, 2020 at 10:50
  • I can't work out how to set up a jdbc connection using python so far that would allow me to create the connection object. I've looked at the "jaydebeapi" package but I can't work out how to use it from the documentation; it appears to require additional arguments beyond the jdbc url of the database and the credentials. I can't use pyodbc either because I've never been able to get any odbc drivers to work properly on Databricks. Commented May 29, 2020 at 12:24

1 Answer 1

1

I am assuming that your intention to to query SQL using python and if thats the case the below query will work .

%python
from pyspark import SparkConf, SparkContext
from pyspark.sql import SparkSession
conf = SparkConf()
conf.setMaster("local").setAppName("My app")
sc = SparkContext.getOrCreate(conf=conf)
spark = SparkSession(sc)
database = "YourDBName"
table = "[dbo].[YourTabelName]"
user = "SqlUser"
password  = "SqlPassword"

DF1 = spark.read.format("jdbc") \
    .option("url", f"jdbc:sqlserver://YourAzureSql.database.windows.net:1433;databaseName={database};") \
    .option("dbtable", table) \
    .option("user", user) \
    .option("password", password) \
    .option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver") \
    .load()
DF1.show()

table = "[dbo].[someOthertable]"

DF2 = spark.read.format("jdbc") \
    .option("url", f"jdbc:sqlserver://YourAzureSql.database.windows.net:1433;databaseName={database};") \
    .option("dbtable", table) \
    .option("user", user) \
    .option("password", password) \
    .option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver") \
    .load()
DF2.show()

Finaldf = DF1.join(DF2,(DF1.Prop_0 == DF2.prop_0),how="inner").select(DF1.Prop_0,DF1.Prop_1,DF2.Address)
Finaldf.show()
Sign up to request clarification or add additional context in comments.

3 Comments

This looks promising for importing a whole table, but how would I modify it if I wanted to run an SQL query that joins several tables from the same database and includes "where" clauses, case statements etc. and import only the results of that query to the python dataframe? E.g. in R I would do something like: df <- SparkR::read.jdbc(jdbc_url, "(SELECT TOP 10 * FROM [dbo].[mytable]) as result" ) %>% SparkR::collect() Where the jdbc url specifies the database and credentials but not the table. Is there something similarly straightforward in Python?
My apoloziges , i have not worked on R yet :) . But i have updated the code for a join condition . HTH
Is there no way to put the join/case/when/and/or statements within an SQL query and importing only the results rather than importing two or more entire tables and then joining them in python? A lot of these tables are very large and I only need to import very small subsets of them.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.