3

I want to convert Array[org.apache.spark.sql.Row] to a DataFrame. Could anyone suggest me a better way?

I tried to first convert it into RDD and then tried to convert it into Dataframe , but when I perform any operation on the DataFrame , exceptions are shown.

val arrayOfRows = myDataFrame.collect().map(t => myfun(t))
val distDataRDD = sc.parallelize(arrayOfRows)
val newDataframe = sqlContext.createDataFrame(distDataRDD,myschema)

Here myfun() is a function which returns Row (org.apache.spark.sql.Row). The contents in the array is correct and I am able to print it without any problem.

But when I tried to count the records in the RDD, it gave me the count as well as a warning that one of the stage contains a task of very large size.I guess I am doing something wrong. Please help.

2 Answers 2

1

You have a bug in the first line. collect returns an Array while map is a method that operates on DataFrames/RDDs.

Try val arrayOfRows = myDataFrame.map(t => myfun(t)).collect() instead.

Sign up to request clarification or add additional context in comments.

3 Comments

I am getting this error when i change the order org.apache.spark.SparkException: Task not serializable
arrayofRows is actually of type DataFrame so there's no need for lines 2 and 3 (sc.parallelize accepts RDDs and not DataFrames, which is the reason behind the new exception)
I am getting the error as soon as I enter the first line val arrayOfRows = myDataFrame.collect().map(t => myfun(t))
1
case class PgRnk (userId : Long , pageRank: Double ) 
// create a case class 

sc.parallelize(pg10.map(r1 => PgRnk(r1.getLong(0), r1.getDouble(1)))).toDS() 
// convert into a dataset, sc.parallelize converts the array into a RDD, and then to DS 

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.