Trying to convert a "org.apache.spark.sql.DataFrame" object to pandas dataframe results in error "name 'dataframe' is not defined" in Databricks

Question

I am trying to query an SQL database via jdbc connection in databricks and store the query results as a pandas dataframe. All of the methods I can find for this online involve storing it as a type of Spark object first using Scala code and then converting this to pandas. I tried for cell 1:

%scala
val df_table1 = sqlContext.read.format("jdbc").options(Map(
    ("url" -> "jdbc:sqlserver://myserver.database.windows.net:1433;database=mydb"),
    ("dbtable" -> "(select top 10 * from myschema.table) as table"),
    ("user" -> "user"),
    ("password" -> "password123"),
    ("driver" -> "com.microsoft.sqlserver.jdbc.SQLServerDriver"))
).load()

which results in:

df_table1: org.apache.spark.sql.DataFrame = [var1: int, var2: string ... 50 more fields]

Great! But when I try to convert it to a pandas df in cell 2 so I can use it:

import numpy as np
import pandas as pd 

result_pdf = df_table1.select("*").toPandas()

print(result_pdf)

It generates the error message:

NameError: name 'df_table1' is not defined

How do I successfully convert this object to a pandas dataframe, or alternatively is there any way of querying the SQL database via jdbc connection using python code without needing to use Scala at all (I do not particularly like Scala syntax and would rather avoid it if at all possible)?

Have you tried loading the data directly to pandas? It has a method exactly for this - pd.read_sql(query, sql_conn) — matkurek
– matkurek, Commented May 29, 2020 at 10:50
I can't work out how to set up a jdbc connection using python so far that would allow me to create the connection object. I've looked at the "jaydebeapi" package but I can't work out how to use it from the documentation; it appears to require additional arguments beyond the jdbc url of the database and the credentials. I can't use pyodbc either because I've never been able to get any odbc drivers to work properly on Databricks. — Mel
– Mel, Commented May 29, 2020 at 12:24

Himanshu Kumar Sinha · Accepted Answer · 2020-06-01 19:49:30Z

1

I am assuming that your intention to to query SQL using python and if thats the case the below query will work .

%python
from pyspark import SparkConf, SparkContext
from pyspark.sql import SparkSession
conf = SparkConf()
conf.setMaster("local").setAppName("My app")
sc = SparkContext.getOrCreate(conf=conf)
spark = SparkSession(sc)
database = "YourDBName"
table = "[dbo].[YourTabelName]"
user = "SqlUser"
password  = "SqlPassword"

DF1 = spark.read.format("jdbc") \
    .option("url", f"jdbc:sqlserver://YourAzureSql.database.windows.net:1433;databaseName={database};") \
    .option("dbtable", table) \
    .option("user", user) \
    .option("password", password) \
    .option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver") \
    .load()
DF1.show()

table = "[dbo].[someOthertable]"

DF2 = spark.read.format("jdbc") \
    .option("url", f"jdbc:sqlserver://YourAzureSql.database.windows.net:1433;databaseName={database};") \
    .option("dbtable", table) \
    .option("user", user) \
    .option("password", password) \
    .option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver") \
    .load()
DF2.show()

Finaldf = DF1.join(DF2,(DF1.Prop_0 == DF2.prop_0),how="inner").select(DF1.Prop_0,DF1.Prop_1,DF2.Address)
Finaldf.show()

edited Jun 1, 2020 at 19:49

answered May 29, 2020 at 21:53

Himanshu Kumar Sinha

1,8082 gold badges9 silver badges11 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Mel Over a year ago

This looks promising for importing a whole table, but how would I modify it if I wanted to run an SQL query that joins several tables from the same database and includes "where" clauses, case statements etc. and import only the results of that query to the python dataframe? E.g. in R I would do something like: df <- SparkR::read.jdbc(jdbc_url, "(SELECT TOP 10 * FROM [dbo].[mytable]) as result" ) %>% SparkR::collect() Where the jdbc url specifies the database and credentials but not the table. Is there something similarly straightforward in Python?

Himanshu Kumar Sinha Over a year ago

My apoloziges , i have not worked on R yet :) . But i have updated the code for a join condition . HTH

Mel Over a year ago

Is there no way to put the join/case/when/and/or statements within an SQL query and importing only the results rather than importing two or more entire tables and then joining them in python? A lot of these tables are very large and I only need to import very small subsets of them.

Collectives™ on Stack Overflow

Trying to convert a "org.apache.spark.sql.DataFrame" object to pandas dataframe results in error "name 'dataframe' is not defined" in Databricks

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related