Spark sql Dataframe - import sqlContext.implicits._

Question

I have main that creates spark context:

    val sc = new SparkContext(sparkConf)
    val sqlContext = new org.apache.spark.sql.SQLContext(sc)
    import sqlContext.implicits._

Then creates dataframe and does filters and validations on the dataframe.

    val convertToHourly = udf((time: String) => time.substring(0, time.indexOf(':')) + ":00:00")

    val df = sqlContext.read.schema(struct).format("com.databricks.spark.csv").load(args(0))
    // record length cannot be < 2 
    .na.drop(3)
    // round to hours
    .withColumn("time",convertToHourly($"time"))

This works great.

BUT When I try moving my validations to another file by sending the dataframe to

function ValidateAndTransform(df: DataFrame) : DataFrame = {...}

that gets the Dataframe & does the validations and transformations: It seems like I need the

 import sqlContext.implicits._

To avoid the error: “value $ is not a member of StringContext” that happens on line: .withColumn("time",convertToHourly($"time"))

But to use the import sqlContext.implicits._ I also need the sqlContext either defined in the new file like so:

val sc = new SparkContext(sparkConf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)

or send it to the

function ValidateAndTransform(df: DataFrame) : DataFrame = {...}
function

I feel like the separation I'm trying to do to 2 files (main & validation) is not done correctly...

Any idea on how to design this? Or simply send the sqlContext to the function?

Thanks!

When I want to separate things like that I just pass SQLContext in the constructor of the new class and then I import sqlContext.implicits._ once per each class. I couldn't come up with anything better so I vote this question up and wait for better sugestions. — TheMP
– TheMP, Commented Sep 8, 2015 at 11:12

Marco · Accepted Answer · 2015-09-08 15:30:05Z

14

You can work with a singleton instance of the SQLContext. You can take a look at this example in the spark repository

/** Lazily instantiated singleton instance of SQLContext */
object SQLContextSingleton {

  @transient  private var instance: SQLContext = _

  def getInstance(sparkContext: SparkContext): SQLContext = {
    if (instance == null) {
      instance = new SQLContext(sparkContext)
    }
    instance
  }
}
...
//And wherever you want you can do
val sqlContext = SQLContextSingleton.getInstance(rdd.sparkContext)
import sqlContext.implicits._

answered Sep 8, 2015 at 15:30

Marco

4,9672 gold badges24 silver badges27 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Etti Gur Over a year ago

Thanks! I did use the singleton object but in my case I want it created only once so did: object SQLContextSingleton { @transient var instance: SQLContext = _ } then initialized it from main, and used it on validations. Thanks for the help!

Collectives™ on Stack Overflow

Spark sql Dataframe - import sqlContext.implicits._

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related