Unable to create dataframe using SQLContext object in spark2.2

Question

I am using spark 2.2 version on Microsoft Windows 7. I want to load csv file in one variable to perform SQL related actions later on but unable to do so. I referred accepted answer from this link but of no use. I followed below steps for creating SparkContext object and SQLContext object:

import org.apache.spark.SparkContext  
import org.apache.spark.SparkConf  
val sc=SparkContext.getOrCreate() // Creating spark context object 
val sqlContext = new org.apache.spark.sql.SQLContext(sc) // Creating SQL object for query related tasks

Objects are created successfully but when I execute below code it throws an error which can't be posted here.

val df = sqlContext.read.format("csv").option("header", "true").load("D://ResourceData.csv")

And when I try something like df.show(2) it says that df was not found. I tried databricks solution for loading CSV from the attached link. It downloads the packages but doesn't load csv file. So how can I rectify my problem?? Thanks in advance :)

I am able to create objects sir. I tried playing with sc object and it works perfectly — whatsinthename
– whatsinthename, Commented Dec 25, 2017 at 6:48
as @undefined_variable suggested, you can use SparkSession to do these. If you are running spark-shell you will get SparkSession in spark variable. — deadbug
– deadbug, Commented Dec 25, 2017 at 6:59
I already tried this : import org.apache.spark.sql.SparkSession val spark = SparkSession.builder. master("local") .appName("spark session example") .getOrCreate() @VipinGS — whatsinthename
– whatsinthename, Commented Dec 25, 2017 at 7:33

whatsinthename · Accepted Answer · 2017-12-29 17:05:55Z

3

I solved my problem for loading local file in dataframe using 1.6 version in cloudera VM with the help of below code:

1) sudo spark-shell --jars /usr/lib/spark/lib/spark-csv_2.10-1.5.0.jar,/usr/lib/spark/lib/commons-csv-1.5.jar,/usr/lib/spark/lib/univocity-parsers-1.5.1.jar  

2) val df1 = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").option("treatEmptyValuesAsNulls", "true" ).option("parserLib", "univocity").load("file:///home/cloudera/Desktop/ResourceData.csv")

NOTE: sc and sqlContext variables are automatically created But there are many improvements in the latest version i.e 2.2.1 which I am unable to use because metastore_db doesn't gets created in windows 7. I ll post a new question regarding the same.

answered Dec 29, 2017 at 17:05

whatsinthename

2,1771 gold badge30 silver badges74 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

deadbug · Accepted Answer · 2017-12-25 16:50:30Z

2

In reference with your comment that you are able to access SparkSession variable, then follow below steps to process your csv file using SparkSQL.

Spark SQL is a Spark module for structured data processing.

There are mainly two abstractions - Dataset and Dataframe :

A Dataset is a distributed collection of data.

A DataFrame is a Dataset organized into named columns. In the Scala API, DataFrame is simply a type alias of Dataset[Row].

With a SparkSession, applications can create DataFrames from an existing RDD, from a Hive table, or from Spark data sources.

You have a csv file and you can simply create a dataframe by doing one of the following:

From your spark-shell using the SparkSession variable spark:

val df = spark.read .format("csv") .option("header", "true") .load("sample.csv")

After reading the file into dataframe, you can register it into a temporary view.

df.createOrReplaceTempView("foo")

SQL statements can be run by using the sql methods provided by Spark

val fooDF = spark.sql("SELECT name, age FROM foo WHERE age BETWEEN 13 AND 19")

You can also query that file directly with SQL:

val df = spark.sql("SELECT * FROM csv.'file:///path to the file/'")

Make sure that you run spark in local mode when you load data from local, or else you will get error. The error occurs when you have already set HADOOP_CONF_DIR environment variable,and which expects "hdfs://..." otherwise "file://".
Set your spark.sql.warehouse.dir (default: ${system:user.dir}/spark-warehouse).

.config("spark.sql.warehouse.dir", "file:///C:/path/to/my/")

It is the default location of Hive warehouse directory (using Derby) with managed databases and tables. Once you set the warehouse directory, Spark will be able to locate your files, and you can load csv.

Reference : Spark SQL Programming Guide

edited Dec 25, 2017 at 16:50

answered Dec 25, 2017 at 10:40

deadbug

4445 silver badges20 bronze badges

11 Comments

whatsinthename Over a year ago

I tried this line : val df = spark.read.format("csv").option("header", "true").load("D://ResourceData.csv") It throws some long error and when i execute df.show() it gives df was not found

whatsinthename Over a year ago

I want to load data from local machine not from hdfs. Can you check at your end the working path pls

deadbug Over a year ago

Updated with the required path info.

whatsinthename Over a year ago

i am trying and is there any best tool in which i can use spark instead of playing it in cmd

deadbug Over a year ago

One Google away : here. Please consider to comment on the specific question related.

|

ReKx · Accepted Answer · 2017-12-27 02:35:28Z

0

Spark version 2.2.0 has built-in support for csv.

In your spark-shell run the following code

val df= spark.read
             .option("header","true")
             .csv("D:/abc.csv")

df: org.apache.spark.sql.DataFrame = [Team_Id: string, Team_Name: string ... 1 more field]

edited Dec 27, 2017 at 2:35

answered Dec 27, 2017 at 2:28

ReKx

1,0763 gold badges11 silver badges23 bronze badges

2 Comments

whatsinthename Over a year ago

Okay i ll try this n ll let u know

whatsinthename Over a year ago

I got this error 17/12/27 10:36:38 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException java.lang.IllegalArgumentException: Error while instantiating 'org.apache.spark.sql.hive.HiveSessionStateBuilder': at org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$instantiateSessionState(SparkSession.scala:1062)

Collectives™ on Stack Overflow

Unable to create dataframe using SQLContext object in spark2.2

3 Answers 3

Comments

11 Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

11 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related