14

I have main that creates spark context:

    val sc = new SparkContext(sparkConf)
    val sqlContext = new org.apache.spark.sql.SQLContext(sc)
    import sqlContext.implicits._

Then creates dataframe and does filters and validations on the dataframe.

    val convertToHourly = udf((time: String) => time.substring(0, time.indexOf(':')) + ":00:00")

    val df = sqlContext.read.schema(struct).format("com.databricks.spark.csv").load(args(0))
    // record length cannot be < 2 
    .na.drop(3)
    // round to hours
    .withColumn("time",convertToHourly($"time"))

This works great.

BUT When I try moving my validations to another file by sending the dataframe to

function ValidateAndTransform(df: DataFrame) : DataFrame = {...}

that gets the Dataframe & does the validations and transformations: It seems like I need the

 import sqlContext.implicits._

To avoid the error: “value $ is not a member of StringContext” that happens on line: .withColumn("time",convertToHourly($"time"))

But to use the import sqlContext.implicits._ I also need the sqlContext either defined in the new file like so:

val sc = new SparkContext(sparkConf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)

or send it to the

function ValidateAndTransform(df: DataFrame) : DataFrame = {...}
function

I feel like the separation I'm trying to do to 2 files (main & validation) is not done correctly...

Any idea on how to design this? Or simply send the sqlContext to the function?

Thanks!

1
  • 1
    When I want to separate things like that I just pass SQLContext in the constructor of the new class and then I import sqlContext.implicits._ once per each class. I couldn't come up with anything better so I vote this question up and wait for better sugestions. Commented Sep 8, 2015 at 11:12

1 Answer 1

14

You can work with a singleton instance of the SQLContext. You can take a look at this example in the spark repository

/** Lazily instantiated singleton instance of SQLContext */
object SQLContextSingleton {

  @transient  private var instance: SQLContext = _

  def getInstance(sparkContext: SparkContext): SQLContext = {
    if (instance == null) {
      instance = new SQLContext(sparkContext)
    }
    instance
  }
}
...
//And wherever you want you can do
val sqlContext = SQLContextSingleton.getInstance(rdd.sparkContext)
import sqlContext.implicits._  
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks! I did use the singleton object but in my case I want it created only once so did: object SQLContextSingleton { @transient var instance: SQLContext = _ } then initialized it from main, and used it on validations. Thanks for the help!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.