Renaming column names of a DataFrame in Spark Scala

Question

I am trying to convert all the headers / column names of a DataFrame in Spark-Scala. as of now I come up with following code which only replaces a single column name.

for( i <- 0 to origCols.length - 1) {
  df.withColumnRenamed(
    df.columns(i), 
    df.columns(i).toLowerCase
  );
}

zero323 · Accepted Answer · 2017-03-26 20:15:56Z

264

If structure is flat:

val df = Seq((1L, "a", "foo", 3.0)).toDF
df.printSchema
// root
//  |-- _1: long (nullable = false)
//  |-- _2: string (nullable = true)
//  |-- _3: string (nullable = true)
//  |-- _4: double (nullable = false)

the simplest thing you can do is to use toDF method:

val newNames = Seq("id", "x1", "x2", "x3")
val dfRenamed = df.toDF(newNames: _*)

dfRenamed.printSchema
// root
// |-- id: long (nullable = false)
// |-- x1: string (nullable = true)
// |-- x2: string (nullable = true)
// |-- x3: double (nullable = false)

If you want to rename individual columns you can use either select with alias:

df.select($"_1".alias("x1"))

which can be easily generalized to multiple columns:

val lookup = Map("_1" -> "foo", "_3" -> "bar")

df.select(df.columns.map(c => col(c).as(lookup.getOrElse(c, c))): _*)

or withColumnRenamed:

df.withColumnRenamed("_1", "x1")

which use with foldLeft to rename multiple columns:

lookup.foldLeft(df)((acc, ca) => acc.withColumnRenamed(ca._1, ca._2))

With nested structures (structs) one possible option is renaming by selecting a whole structure:

val nested = spark.read.json(sc.parallelize(Seq(
    """{"foobar": {"foo": {"bar": {"first": 1.0, "second": 2.0}}}, "id": 1}"""
)))

nested.printSchema
// root
//  |-- foobar: struct (nullable = true)
//  |    |-- foo: struct (nullable = true)
//  |    |    |-- bar: struct (nullable = true)
//  |    |    |    |-- first: double (nullable = true)
//  |    |    |    |-- second: double (nullable = true)
//  |-- id: long (nullable = true)

@transient val foobarRenamed = struct(
  struct(
    struct(
      $"foobar.foo.bar.first".as("x"), $"foobar.foo.bar.first".as("y")
    ).alias("point")
  ).alias("location")
).alias("record")

nested.select(foobarRenamed, $"id").printSchema
// root
//  |-- record: struct (nullable = false)
//  |    |-- location: struct (nullable = false)
//  |    |    |-- point: struct (nullable = false)
//  |    |    |    |-- x: double (nullable = true)
//  |    |    |    |-- y: double (nullable = true)
//  |-- id: long (nullable = true)

Note that it may affect nullability metadata. Another possibility is to rename by casting:

nested.select($"foobar".cast(
  "struct<location:struct<point:struct<x:double,y:double>>>"
).alias("record")).printSchema

// root
//  |-- record: struct (nullable = true)
//  |    |-- location: struct (nullable = true)
//  |    |    |-- point: struct (nullable = true)
//  |    |    |    |-- x: double (nullable = true)
//  |    |    |    |-- y: double (nullable = true)

or:

import org.apache.spark.sql.types._

nested.select($"foobar".cast(
  StructType(Seq(
    StructField("location", StructType(Seq(
      StructField("point", StructType(Seq(
        StructField("x", DoubleType), StructField("y", DoubleType)))))))))
).alias("record")).printSchema

// root
//  |-- record: struct (nullable = true)
//  |    |-- location: struct (nullable = true)
//  |    |    |-- point: struct (nullable = true)
//  |    |    |    |-- x: double (nullable = true)
//  |    |    |    |-- y: double (nullable = true)

edited Mar 26, 2017 at 20:15

answered Feb 24, 2016 at 4:00

zero323

331k108 gold badges981 silver badges958 bronze badges

Sign up to request clarification or add additional context in comments.

9 Comments

Umesh Kacha Over a year ago

Hi @zero323 When using withColumnRenamed I am getting AnalysisException can't resolve 'CC8. 1' given input columns... It fails even though CC8.1 is available in DataFrame please guide.

zero323 Over a year ago

@u449355 It is not clear for me if this is nested column or a one containing dots. In the later case backticks should work (at least in some basic cases).

Anton Kim Over a year ago

what does : _*) mean in df.select(df.columns.map(c => col(c).as(lookup.getOrElse(c, c))): _*)

Mylo Stone Over a year ago

To answer Anton Kim's question: the : _* is the scala so-called "splat" operator. It basically explodes an array-like thing into an uncontained list, which is useful when you want to pass the array to a function that takes an arbitrary number of args, but doesn't have a version that takes a List[]. If you're at all familiar with Perl, it is the difference between some_function(@my_array) # "splatted" and some_function(\@my_array) # not splatted ... in perl the backslash "\" operator returns a reference to a thing.

Imad Over a year ago

This statement is really obscure to me df.select(df.columns.map(c => col(c).as(lookup.getOrElse(c, c))): _*).. Could you decompose it please? especially the lookup.getOrElse(c,c) part.

|

Tagar · Accepted Answer · 2019-12-11 16:40:43Z

22

For those of you interested in PySpark version (actually it's same in Scala - see comment below) :

    merchants_df_renamed = merchants_df.toDF(
        'merchant_id', 'category', 'subcategory', 'merchant')

    merchants_df_renamed.printSchema()

Result:

root
|-- merchant_id: integer (nullable = true)
|-- category: string (nullable = true)
|-- subcategory: string (nullable = true)
|-- merchant: string (nullable = true)

edited Dec 11, 2019 at 16:40

answered Mar 26, 2017 at 20:23

Tagar

15k7 gold badges102 silver badges116 bronze badges

1 Comment

Ihor Konovalenko Over a year ago

With using toDF() for renaming columns in DataFrame must be careful. This method works much slower than others. I have DataFrame contains 100M records and simple count query over it take ~3s, whereas the same query with toDF() method take ~16s. But when use select col AS col_new method for renaming I get ~3s again. More than 5 times faster! Spark 2.3.2.3

Mylo Stone · Accepted Answer · 2018-04-12 23:06:56Z

7

def aliasAllColumns(t: DataFrame, p: String = "", s: String = ""): DataFrame =
{
  t.select( t.columns.map { c => t.col(c).as( p + c + s) } : _* )
}

In case is isn't obvious, this adds a prefix and a suffix to each of the current column names. This can be useful when you have two tables with one or more columns having the same name, and you wish to join them but still be able to disambiguate the columns in the resultant table. It sure would be nice if there were a similar way to do this in "normal" SQL.

edited Apr 12, 2018 at 23:06

answered May 4, 2017 at 23:06

Mylo Stone

3093 silver badges8 bronze badges

1 Comment

Ged Over a year ago

like it for sure, nice and elegant

Jagadeesh Verri · Accepted Answer · 2019-12-01 18:06:31Z

1

Suppose the dataframe df has 3 columns id1, name1, price1 and you wish to rename them to id2, name2, price2

val list = List("id2", "name2", "price2")
import spark.implicits._
val df2 = df.toDF(list:_*)
df2.columns.foreach(println)

I found this approach useful in many cases.

answered Dec 1, 2019 at 18:06

Jagadeesh Verri

312 bronze badges

Comments

R. RamkumarYugo · Accepted Answer · 2020-10-21 08:34:20Z

1

Sometime we have the column name is below format in SQLServer or MySQL table

Ex  : Account Number,customer number

But Hive tables do not support column name containing spaces, so please use below solution to rename your old column names.

Solution:

val renamedColumns = df.columns.map(c => df(c).as(c.replaceAll(" ", "_").toLowerCase()))
df = df.select(renamedColumns: _*)

edited Oct 21, 2020 at 8:34

answered Oct 20, 2020 at 4:13

R. RamkumarYugo

411 silver badge7 bronze badges

Comments

Colin Wang · Accepted Answer · 2020-08-17 07:09:50Z

0

tow table join not rename the joined key

// method 1: create a new DF
day1 = day1.toDF(day1.columns.map(x => if (x.equals(key)) x else s"${x}_d1"): _*)

// method 2: use withColumnRenamed
for ((x, y) <- day1.columns.filter(!_.equals(key)).map(x => (x, s"${x}_d1"))) {
    day1 = day1.withColumnRenamed(x, y)
}

works!

answered Aug 17, 2020 at 7:09

Colin Wang

8711 gold badge10 silver badges14 bronze badges

Collectives™ on Stack Overflow

Renaming column names of a DataFrame in Spark Scala

6 Answers 6

9 Comments

1 Comment

1 Comment

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

6 Answers 6

9 Comments

1 Comment

1 Comment

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related