I am trying to calculate correlation for all columns in a Spark dataframe using the below code.
import org.apache.spark.ml.linalg.{Matrix, Vectors}
import org.apache.spark.ml.stat.Correlation
import org.apache.spark.sql.Row
import org.apache.spark.sql.SparkSession
import org.apache.spark.ml.feature.VectorAssembler
val spark = SparkSession
.builder
.appName("SparkCorrelation")
.master("local[*]")
.getOrCreate()
import spark.implicits._
val df = Seq(
(0.1, 0.3, 0.5),
(0.2, 0.4, 0.6),
).toDF("c1", "c2", "c3")
val assembler = new VectorAssembler()
.setInputCols(Array("c1", "c2", "c3"))
.setOutputCol("vectors")
val transformed = assembler.transform(df)
val corr = Correlation.corr(transformed, "vectors","pearson")
corr.show(100,false)
My output comes out as a dataframe with one column.
| pearson(vectors) |
|---|
| 1.0 1.0000000000000002 0.9999999999999998 \n1.0000000000000002 1.0 1.0000000000000002 \n0.9999999999999998 1.0000000000000002 1.0 |
but I want my output in the following format. Can somebody please help?
| Column | c1 | c2 | c3 |
|---|---|---|---|
| c1 | 1 | 0.97 | 0.92 |
| c2 | 0.97 | 1 | 0.94 |
| c3 | 0.92 | 0.94 | 1 |