0

I'm new to Apache Spark and Scala (also a beginner with Hadoop in general). I completed the Spark SQL tutorial: https://spark.apache.org/docs/latest/sql-programming-guide.html I tried to perform a simple query on a standard csv file to benchmark its performance on my current cluster.

I used data from https://s3.amazonaws.com/hw-sandbox/tutorial1/NYSE-2000-2001.tsv.gz, converted it to csv and copy/pasted the data to make it 10 times as big.

I loaded it into Spark using Scala:

// sc is an existing SparkContext.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)

// createSchemaRDD is used to implicitly convert an RDD to a SchemaRDD.
import sqlContext.createSchemaRDD

Define classes:

case class datum(exchange: String,stock_symbol: String,date: String,stock_price_open: Double,stock_price_high: Double,stock_price_low: Double,stock_price_close: Double,stock_volume: String,stock_price_adj_close: Double)

Read in data:

val data = sc.textFile("input.csv").map(_.split(";")).filter(line => "exchange" != "exchange").map(p => datum(p(0).trim.toString, p(1).trim.toString, p(2).trim.toString, p(3).trim.toDouble, p(4).trim.toDouble, p(5).trim.toDouble, p(6).trim.toDouble, p(7).trim.toString, p(8).trim.toDouble))

Convert to table:

data.registerAsTable("data")

Define query (list all rows with 'IBM' as stock symbol):

val IBMs = sqlContext.sql("SELECT * FROM data WHERE stock_symbol ='IBM'")

Perform count so query actually runs:

IBMs.count()

The query runs fine, but returns res: 0 instead of 5000 (which is what it returns using Hive with MapReduce).

1 Answer 1

5

filter(line => "exchange" != "exchange")

Since "exchange" is equal to "exchange" filter will return a collection of size 0. And since there is no data, querying for any result will return 0. You need to re-write your logic.

Sign up to request clarification or add additional context in comments.

3 Comments

I thought this was the way to remove the first line (containing headers) in Scala. So how do I remove the first line when reading in a csv file?
filter(line => line != "exchange")
Thanks, that makes sense. I did not try it out: I got tired of the headers and just hard deleted them from the csv file.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.