107

I am trying to convert all the headers / column names of a DataFrame in Spark-Scala. as of now I come up with following code which only replaces a single column name.

for( i <- 0 to origCols.length - 1) {
  df.withColumnRenamed(
    df.columns(i), 
    df.columns(i).toLowerCase
  );
}

6 Answers 6

264

If structure is flat:

val df = Seq((1L, "a", "foo", 3.0)).toDF
df.printSchema
// root
//  |-- _1: long (nullable = false)
//  |-- _2: string (nullable = true)
//  |-- _3: string (nullable = true)
//  |-- _4: double (nullable = false)

the simplest thing you can do is to use toDF method:

val newNames = Seq("id", "x1", "x2", "x3")
val dfRenamed = df.toDF(newNames: _*)

dfRenamed.printSchema
// root
// |-- id: long (nullable = false)
// |-- x1: string (nullable = true)
// |-- x2: string (nullable = true)
// |-- x3: double (nullable = false)

If you want to rename individual columns you can use either select with alias:

df.select($"_1".alias("x1"))

which can be easily generalized to multiple columns:

val lookup = Map("_1" -> "foo", "_3" -> "bar")

df.select(df.columns.map(c => col(c).as(lookup.getOrElse(c, c))): _*)

or withColumnRenamed:

df.withColumnRenamed("_1", "x1")

which use with foldLeft to rename multiple columns:

lookup.foldLeft(df)((acc, ca) => acc.withColumnRenamed(ca._1, ca._2))

With nested structures (structs) one possible option is renaming by selecting a whole structure:

val nested = spark.read.json(sc.parallelize(Seq(
    """{"foobar": {"foo": {"bar": {"first": 1.0, "second": 2.0}}}, "id": 1}"""
)))

nested.printSchema
// root
//  |-- foobar: struct (nullable = true)
//  |    |-- foo: struct (nullable = true)
//  |    |    |-- bar: struct (nullable = true)
//  |    |    |    |-- first: double (nullable = true)
//  |    |    |    |-- second: double (nullable = true)
//  |-- id: long (nullable = true)

@transient val foobarRenamed = struct(
  struct(
    struct(
      $"foobar.foo.bar.first".as("x"), $"foobar.foo.bar.first".as("y")
    ).alias("point")
  ).alias("location")
).alias("record")

nested.select(foobarRenamed, $"id").printSchema
// root
//  |-- record: struct (nullable = false)
//  |    |-- location: struct (nullable = false)
//  |    |    |-- point: struct (nullable = false)
//  |    |    |    |-- x: double (nullable = true)
//  |    |    |    |-- y: double (nullable = true)
//  |-- id: long (nullable = true)

Note that it may affect nullability metadata. Another possibility is to rename by casting:

nested.select($"foobar".cast(
  "struct<location:struct<point:struct<x:double,y:double>>>"
).alias("record")).printSchema

// root
//  |-- record: struct (nullable = true)
//  |    |-- location: struct (nullable = true)
//  |    |    |-- point: struct (nullable = true)
//  |    |    |    |-- x: double (nullable = true)
//  |    |    |    |-- y: double (nullable = true)

or:

import org.apache.spark.sql.types._

nested.select($"foobar".cast(
  StructType(Seq(
    StructField("location", StructType(Seq(
      StructField("point", StructType(Seq(
        StructField("x", DoubleType), StructField("y", DoubleType)))))))))
).alias("record")).printSchema

// root
//  |-- record: struct (nullable = true)
//  |    |-- location: struct (nullable = true)
//  |    |    |-- point: struct (nullable = true)
//  |    |    |    |-- x: double (nullable = true)
//  |    |    |    |-- y: double (nullable = true)
Sign up to request clarification or add additional context in comments.

9 Comments

Hi @zero323 When using withColumnRenamed I am getting AnalysisException can't resolve 'CC8. 1' given input columns... It fails even though CC8.1 is available in DataFrame please guide.
@u449355 It is not clear for me if this is nested column or a one containing dots. In the later case backticks should work (at least in some basic cases).
what does : _*) mean in df.select(df.columns.map(c => col(c).as(lookup.getOrElse(c, c))): _*)
To answer Anton Kim's question: the : _* is the scala so-called "splat" operator. It basically explodes an array-like thing into an uncontained list, which is useful when you want to pass the array to a function that takes an arbitrary number of args, but doesn't have a version that takes a List[]. If you're at all familiar with Perl, it is the difference between some_function(@my_array) # "splatted" and some_function(\@my_array) # not splatted ... in perl the backslash "\" operator returns a reference to a thing.
This statement is really obscure to me df.select(df.columns.map(c => col(c).as(lookup.getOrElse(c, c))): _*).. Could you decompose it please? especially the lookup.getOrElse(c,c) part.
|
22

For those of you interested in PySpark version (actually it's same in Scala - see comment below) :

    merchants_df_renamed = merchants_df.toDF(
        'merchant_id', 'category', 'subcategory', 'merchant')

    merchants_df_renamed.printSchema()

Result:

root
|-- merchant_id: integer (nullable = true)
|-- category: string (nullable = true)
|-- subcategory: string (nullable = true)
|-- merchant: string (nullable = true)

1 Comment

With using toDF() for renaming columns in DataFrame must be careful. This method works much slower than others. I have DataFrame contains 100M records and simple count query over it take ~3s, whereas the same query with toDF() method take ~16s. But when use select col AS col_new method for renaming I get ~3s again. More than 5 times faster! Spark 2.3.2.3
7
def aliasAllColumns(t: DataFrame, p: String = "", s: String = ""): DataFrame =
{
  t.select( t.columns.map { c => t.col(c).as( p + c + s) } : _* )
}

In case is isn't obvious, this adds a prefix and a suffix to each of the current column names. This can be useful when you have two tables with one or more columns having the same name, and you wish to join them but still be able to disambiguate the columns in the resultant table. It sure would be nice if there were a similar way to do this in "normal" SQL.

1 Comment

like it for sure, nice and elegant
1

Suppose the dataframe df has 3 columns id1, name1, price1 and you wish to rename them to id2, name2, price2

val list = List("id2", "name2", "price2")
import spark.implicits._
val df2 = df.toDF(list:_*)
df2.columns.foreach(println)

I found this approach useful in many cases.

Comments

1
Sometime we have the column name is below format in SQLServer or MySQL table

Ex  : Account Number,customer number

But Hive tables do not support column name containing spaces, so please use below solution to rename your old column names.

Solution:

val renamedColumns = df.columns.map(c => df(c).as(c.replaceAll(" ", "_").toLowerCase()))
df = df.select(renamedColumns: _*)

Comments

0

tow table join not rename the joined key

// method 1: create a new DF
day1 = day1.toDF(day1.columns.map(x => if (x.equals(key)) x else s"${x}_d1"): _*)

// method 2: use withColumnRenamed
for ((x, y) <- day1.columns.filter(!_.equals(key)).map(x => (x, s"${x}_d1"))) {
    day1 = day1.withColumnRenamed(x, y)
}

works!

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.