How to change dataframe column names in PySpark?

Question

I come from pandas background and am used to reading data from CSV files into a dataframe and then simply changing the column names to something useful using the simple command:

df.columns = new_column_name_list

However, the same doesn't work in PySpark dataframes created using sqlContext. The only solution I could figure out to do this easily is the following:

df = sqlContext.read.format("com.databricks.spark.csv").options(header='false', inferschema='true', delimiter='\t').load("data.txt")
oldSchema = df.schema
for i,k in enumerate(oldSchema.fields):
  k.name = new_column_name_list[i]
df = sqlContext.read.format("com.databricks.spark.csv").options(header='false', delimiter='\t').load("data.txt", schema=oldSchema)

This is basically defining the variable twice and inferring the schema first then renaming the column names and then loading the dataframe again with the updated schema.

Is there a better and more efficient way to do this like we do in pandas?

My Spark version is 1.5.0

Cristian Ispan · Accepted Answer · 2021-06-08 19:49:31Z

529

There are many ways to do that:

Option 1. Using selectExpr.

 data = sqlContext.createDataFrame([("Alberto", 2), ("Dakota", 2)], 
                                   ["Name", "askdaosdka"])
 data.show()
 data.printSchema()

 # Output
 #+-------+----------+
 #|   Name|askdaosdka|
 #+-------+----------+
 #|Alberto|         2|
 #| Dakota|         2|
 #+-------+----------+

 #root
 # |-- Name: string (nullable = true)
 # |-- askdaosdka: long (nullable = true)

 df = data.selectExpr("Name as name", "askdaosdka as age")
 df.show()
 df.printSchema()

 # Output
 #+-------+---+
 #|   name|age|
 #+-------+---+
 #|Alberto|  2|
 #| Dakota|  2|
 #+-------+---+

 #root
 # |-- name: string (nullable = true)
 # |-- age: long (nullable = true)

Option 2. Using withColumnRenamed, notice that this method allows you to "overwrite" the same column. For Python3, replace xrange with range.

 from functools import reduce

 oldColumns = data.schema.names
 newColumns = ["name", "age"]

 df = reduce(lambda data, idx: data.withColumnRenamed(oldColumns[idx], newColumns[idx]), xrange(len(oldColumns)), data)
 df.printSchema()
 df.show()

Option 3. using alias, in Scala you can also use as.

 from pyspark.sql.functions import col

 data = data.select(col("Name").alias("name"), col("askdaosdka").alias("age"))
 data.show()

 # Output
 #+-------+---+
 #|   name|age|
 #+-------+---+
 #|Alberto|  2|
 #| Dakota|  2|
 #+-------+---+

Option 4. Using sqlContext.sql, which lets you use SQL queries on DataFrames registered as tables.

 sqlContext.registerDataFrameAsTable(data, "myTable")
 df2 = sqlContext.sql("SELECT Name AS name, askdaosdka as age from myTable")

 df2.show()

 # Output
 #+-------+---+
 #|   name|age|
 #+-------+---+
 #|Alberto|  2|
 #| Dakota|  2|
 #+-------+---+

edited Jun 8, 2021 at 19:49

Cristian Ispan

7742 gold badges7 silver badges25 bronze badges

answered Dec 3, 2015 at 22:54

Alberto Bonsanto

18.1k10 gold badges67 silver badges93 bronze badges

Sign up to request clarification or add additional context in comments.

18 Comments

Felipe Gerard Over a year ago

I did it with a for loop + withColumnRenamed, but your reduce option is very nice :)

Felipe Gerard Over a year ago

Well since nothing gets done in Spark until an action is called on the DF, it's just less elegant code... In the end the resulting DF is exactly the same!

Alberto Bonsanto Over a year ago

@FelipeGerard Please check this post, bad things may happen if you have many columns.

joaofbsm Over a year ago

@NuValue, you should first run from functools import reduce

rjurney Over a year ago

In PySpark 2.4 with Python 3.6.8 the only method that works out of these is df.select('id').withColumnRenamed('id', 'new_id') and spark.sql("SELECT id AS new_id FROM df")

|

Sotos · Accepted Answer · 2020-07-15 13:58:27Z

320

df = df.withColumnRenamed("colName", "newColName")\
       .withColumnRenamed("colName2", "newColName2")

Advantage of using this way: With long list of columns you would like to change only few column names. This can be very convenient in these scenarios. Very useful when joining tables with duplicate column names.

edited Jul 15, 2020 at 13:58

Sotos

51.6k6 gold badges35 silver badges69 bronze badges

answered Mar 30, 2016 at 7:25

Pankaj Kumar

3,4191 gold badge17 silver badges9 bronze badges

5 Comments

Quetzalcoatl Over a year ago

is there a variant of this solution that leaves all other columns unchanged? with this method, and others, only the explicitly named columns remained (all others removed)

mnis.p Over a year ago

+1 it worked fine for me, just edited the specified column leaving others unchanged and no columns were removed.

user989762 Over a year ago

@Quetzalcoatl This command appears to change only the specified column while maintaining all other columns. Hence, a great command to rename just one of potentially many column names

Quetzalcoatl Over a year ago

@user989762: agreed; my initial understanding was incorrect on this one...!

Powers Over a year ago

This is great for renaming a few columns. See my answer for a solution that can programatically rename columns. Say you have 200 columns and you'd like to rename 50 of them that have a certain type of column name and leave the other 150 unchanged. In that case, you won't want to manually run withColumnRenamed (running withColumnRenamed that many times would also be inefficient, as explained here).

Petter Friberg · Accepted Answer · 2017-06-06 05:56:13Z

128

If you want to change all columns names, try df.toDF(*cols)

edited Jun 6, 2017 at 5:56

Petter Friberg

21.8k10 gold badges67 silver badges116 bronze badges

answered Jun 6, 2017 at 5:52

user8117731

1,2891 gold badge8 silver badges2 bronze badges

4 Comments

Quetzalcoatl Over a year ago

this solution is the closest to df.columns = new_column_name_list per the OP, both in how concise it is and its execution.

Nic Scozzaro Over a year ago

For me I was getting the header names from a pandas dataframe, so I just used df = df.toDF(*my_pandas_df.columns)

rbatt Over a year ago

This answer confuses me. Shouldn't there be a mapping from old column names to new names? Does this work by having cols be the new column names, and just assuming the the order of names in cols corresponds to the column order of the dataframe?

Krunal Patel Over a year ago

@rbatt Using df.select in combination with pyspark.sql.functions col-method is a reliable way to do this since it maintains the mapping/alias applied & thus the order/schema is maintained after the rename operations. Checkout the comment for code snippet: stackoverflow.com/a/62728542/8551891

pbahr · Accepted Answer · 2018-04-23 14:50:38Z

83

In case you would like to apply a simple transformation on all column names, this code does the trick: (I am replacing all spaces with underscore)

new_column_name_list= list(map(lambda x: x.replace(" ", "_"), df.columns))

df = df.toDF(*new_column_name_list)

Thanks to @user8117731 for toDf trick.

edited Apr 23, 2018 at 14:50

answered Apr 13, 2018 at 15:17

pbahr

1,36012 silver badges14 bronze badges

1 Comment

Powers Over a year ago

This code generates a simple physical plan that's easy for Catalyst to optimize. It's also elegant. +1

Ratul Ghosh · Accepted Answer · 2017-01-15 15:22:33Z

21

If you want to rename a single column and keep the rest as it is:

from pyspark.sql.functions import col
new_df = old_df.select(*[col(s).alias(new_name) if s == column_to_change else s for s in old_df.columns])

answered Jan 15, 2017 at 15:22

Ratul Ghosh

2112 silver badges4 bronze badges

Comments

Grant Shannon · Accepted Answer · 2018-12-07 15:00:20Z

this is the approach that I used:

create pyspark session:

import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('changeColNames').getOrCreate()

create dataframe:

df = spark.createDataFrame(data = [('Bob', 5.62,'juice'),  ('Sue',0.85,'milk')], schema = ["Name", "Amount","Item"])

view df with column names:

df.show()
+----+------+-----+
|Name|Amount| Item|
+----+------+-----+
| Bob|  5.62|juice|
| Sue|  0.85| milk|
+----+------+-----+

create a list with new column names:

newcolnames = ['NameNew','AmountNew','ItemNew']

change the column names of the df:

for c,n in zip(df.columns,newcolnames):
    df=df.withColumnRenamed(c,n)

view df with new column names:

df.show()
+-------+---------+-------+
|NameNew|AmountNew|ItemNew|
+-------+---------+-------+
|    Bob|     5.62|  juice|
|    Sue|     0.85|   milk|
+-------+---------+-------+

Vedom · Accepted Answer · 2020-05-29 17:58:34Z

15

I made an easy to use function to rename multiple columns for a pyspark dataframe, in case anyone wants to use it:

def renameCols(df, old_columns, new_columns):
    for old_col,new_col in zip(old_columns,new_columns):
        df = df.withColumnRenamed(old_col,new_col)
    return df

old_columns = ['old_name1','old_name2']
new_columns = ['new_name1', 'new_name2']
df_renamed = renameCols(df, old_columns, new_columns)

Be careful, both lists must be the same length.

edited May 29, 2020 at 17:58

Vedom

3,1373 gold badges16 silver badges16 bronze badges

answered Mar 22, 2019 at 11:57

Manrique

2,2914 gold badges19 silver badges41 bronze badges

1 Comment

Darth Egregious Over a year ago

Nice job on this one. A bit of overkill for what I needed though. And you can just pass the df because old_columns would be the same as df.columns.

h4z3 · Accepted Answer · 2022-03-24 09:31:13Z

13

Method 1:

df = df.withColumnRenamed("old_column_name", "new_column_name")

Method 2: If you want to do some computation and rename the new values

df = df.withColumn("old_column_name", F.when(F.col("old_column_name") > 1, F.lit(1)).otherwise(F.col("old_column_name"))
df = df.drop("new_column_name", "old_column_name")

edited Mar 24, 2022 at 9:31

h4z3

5,4951 gold badge18 silver badges32 bronze badges

answered Dec 15, 2020 at 13:45

Gourav Bansal

2273 silver badges5 bronze badges

2 Comments

astentx Over a year ago

There was a lot of similar answers so no need to post another one duplicate.

Sheldore Over a year ago

The first argument in withColumnRenamed is the old column name. Your Method 1 is wrong

scottlittle · Accepted Answer · 2018-06-20 14:24:12Z

12

Another way to rename just one column (using import pyspark.sql.functions as F):

df = df.select( '*', F.col('count').alias('new_count') ).drop('count')

answered Jun 20, 2018 at 14:24

scottlittle

21.3k9 gold badges61 silver badges78 bronze badges

Comments

Clock Slave · Accepted Answer · 2019-10-11 10:19:30Z

7

You can use the following function to rename all the columns of your dataframe.

def df_col_rename(X, to_rename, replace_with):
    """
    :param X: spark dataframe
    :param to_rename: list of original names
    :param replace_with: list of new names
    :return: dataframe with updated names
    """
    import pyspark.sql.functions as F
    mapping = dict(zip(to_rename, replace_with))
    X = X.select([F.col(c).alias(mapping.get(c, c)) for c in to_rename])
    return X

In case you need to update only a few columns' names, you can use the same column name in the replace_with list

To rename all columns

df_col_rename(X,['a', 'b', 'c'], ['x', 'y', 'z'])

To rename a some columns

df_col_rename(X,['a', 'b', 'c'], ['a', 'y', 'z'])

answered Oct 11, 2019 at 10:19

Clock Slave

8,01516 gold badges77 silver badges123 bronze badges

2 Comments

John Haberstroh Over a year ago

I like that this uses the select statement with aliases and uses more of an "immutable" type of framework. I did, however, find that the toDF function and a list comprehension that implements whatever logic is desired was much more succinct. for example, def append_suffix_to_columns(spark_df, suffix): return spark_df.toDF([c + suffix for c in spark_df.columns])

Sheldore Over a year ago

Since mapping is a dictionary, why can't you simply use mapping[c] instead of mapping.get(c, c)?

mike · Accepted Answer · 2022-10-05 05:11:15Z

7

we can use col.alias for renaming the column:

from pyspark.sql.functions import col
df.select(['vin',col('timeStamp').alias('Date')]).show()

edited Oct 5, 2022 at 5:11

answered Jan 31, 2018 at 14:33

mike

1211 silver badge4 bronze badges

Comments

Neeraj Bhadani · Accepted Answer · 2020-05-31 08:40:58Z

We can use various approaches to rename the column name.

First, let create a simple DataFrame.

df = spark.createDataFrame([("x", 1), ("y", 2)], 
                                  ["col_1", "col_2"])

Now let's try to rename col_1 to col_3. PFB a few approaches to do the same.

# Approach - 1 : using withColumnRenamed function.
df.withColumnRenamed("col_1", "col_3").show()

# Approach - 2 : using alias function.
df.select(df["col_1"].alias("col3"), "col_2").show()

# Approach - 3 : using selectExpr function.
df.selectExpr("col_1 as col_3", "col_2").show()

# Rename all columns
# Approach - 4 : using toDF function. Here you need to pass the list of all columns present in DataFrame.
df.toDF("col_3", "col_2").show()

Here is the output.

+-----+-----+
|col_3|col_2|
+-----+-----+
|    x|    1|
|    y|    2|
+-----+-----+

I hope this helps.

lfvv · Accepted Answer · 2021-09-01 15:30:43Z

5

A way that you can use 'alias' to change the column name:

col('my_column').alias('new_name')

Another way that you can use 'alias' (possibly not mentioned):

df.my_column.alias('new_name')

answered Sep 1, 2021 at 15:30

lfvv

1,63918 silver badges17 bronze badges

Comments

Haha TTpro · Accepted Answer · 2020-10-16 07:14:35Z

4

You can put into for loop, and use zip to pairs each column name in two array.

new_name = ["id", "sepal_length_cm", "sepal_width_cm", "petal_length_cm", "petal_width_cm", "species"]

new_df = df
for old, new in zip(df.columns, new_name):
    new_df = new_df.withColumnRenamed(old, new)

answered Oct 16, 2020 at 7:14

Haha TTpro

5,5867 gold badges52 silver badges78 bronze badges

Comments

Michael H. · Accepted Answer · 2020-11-03 11:51:44Z

4

I like to use a dict to rename the df.

rename = {'old1': 'new1', 'old2': 'new2'}
for col in df.schema.names:
    df = df.withColumnRenamed(col, rename[col])

answered Nov 3, 2020 at 11:51

Michael H.

6158 silver badges11 bronze badges

Comments

prashangrg · Accepted Answer · 2023-12-06 15:10:31Z

4

Simplest solution is:

for col, new_col in columns:
    df = df.withColumnRenamed(col, new_col)

answered Dec 6, 2023 at 15:10

prashangrg

691 gold badge2 silver badges8 bronze badges

Comments

dcio · Accepted Answer · 2020-10-10 13:17:57Z

2

There are multiple approaches you can use:

df1=df.withColumn("new_column","old_column").drop(col("old_column"))
df1=df.withColumn("new_column","old_column")
df1=df.select("old_column".alias("new_column"))

edited Oct 10, 2020 at 13:17

dcio

2,3126 gold badges34 silver badges51 bronze badges

answered Oct 10, 2020 at 6:14

pankajs

611 gold badge1 silver badge4 bronze badges

Comments

ZygD · Accepted Answer · 2022-09-06 14:20:48Z

2

List comprehension + f-string:

df = df.toDF(*[f'n_{c}' for c in df.columns])

Simple list comprehension:

df = df.toDF(*[c.lower() for c in df.columns])

answered Sep 6, 2022 at 14:20

ZygD

24.8k41 gold badges106 silver badges144 bronze badges

Comments

ganeiy · Accepted Answer · 2017-06-27 14:42:33Z

1

For a single column rename, you can still use toDF(). For example,

df1.selectExpr("SALARY*2").toDF("REVISED_SALARY").show()

answered Jun 27, 2017 at 14:42

ganeiy

3022 silver badges9 bronze badges

Comments

thedataengineer · Accepted Answer · 2021-03-28 04:50:24Z


from pyspark.sql.types import StructType,StructField, StringType, IntegerType

CreatingDataFrame = [("James","Sales","NY",90000,34,10000),
    ("Michael","Sales","NY",86000,56,20000),
    ("Robert","Sales","CA",81000,30,23000),
    ("Maria","Finance","CA",90000,24,23000),
    ("Raman","Finance","CA",99000,40,24000),
    ("Scott","Finance","NY",83000,36,19000),
    ("Jen","Finance","NY",79000,53,15000),
    ("Jeff","Marketing","CA",80000,25,18000),
    ("Kumar","Marketing","NY",91000,50,21000)
  ]

schema = StructType([ \
    StructField("employee_name",StringType(),True), \
    StructField("department",StringType(),True), \
    StructField("state",StringType(),True), \
    StructField("salary", IntegerType(), True), \
    StructField("age", StringType(), True), \
    StructField("bonus", IntegerType(), True) \
  ])

 
OurData = spark.createDataFrame(data=CreatingDataFrame,schema=schema)

OurData.show()

# COMMAND ----------

GrouppedBonusData=OurData.groupBy("department").sum("bonus")


# COMMAND ----------

GrouppedBonusData.show()


# COMMAND ----------

GrouppedBonusData.printSchema()

# COMMAND ----------

from pyspark.sql.functions import col

BonusColumnRenamed = GrouppedBonusData.select(col("department").alias("department"), col("sum(bonus)").alias("Total_Bonus"))
BonusColumnRenamed.show()

# COMMAND ----------

GrouppedBonusData.groupBy("department").count().show()

# COMMAND ----------

GrouppedSalaryData=OurData.groupBy("department").sum("salary")

# COMMAND ----------

GrouppedSalaryData.show()

# COMMAND ----------

from pyspark.sql.functions import col

SalaryColumnRenamed = GrouppedSalaryData.select(col("department").alias("Department"), col("sum(salary)").alias("Total_Salary"))
SalaryColumnRenamed.show()

Dicer · Accepted Answer · 2022-01-30 04:50:48Z

Try the following method. The following method can allow you rename columns of multiple files

Reference: https://www.linkedin.com/pulse/pyspark-methods-rename-columns-kyle-gibson/

df_initial = spark.read.load('com.databricks.spark.csv')
    
    rename_dict = {
      'Alberto':'Name',
      'Dakota':'askdaosdka'
    }
    
    df_renamed = df_initial \
    .select([col(c).alias(rename_dict.get(c, c)) for c in df_initial.columns])

    
     rename_dict = {
       'FName':'FirstName',
       'LName':'LastName',
       'DOB':'BirthDate'
        }

     return df.select([col(c).alias(rename_dict.get(c, c)) for c in df.columns])


df_renamed = spark.read.load('/mnt/datalake/bronze/testData') \
.transform(renameColumns)

sargupta · Accepted Answer · 2022-04-02 09:42:53Z

1

The simplest solution is using withColumnRenamed:

renamed_df = df.withColumnRenamed(‘name_1’, ‘New_name_1’).withColumnRenamed(‘name_2’, ‘New_name_2’)
renamed_df.show()

And if you would like to do this like we do with Pandas, you can use toDF:

Create an order of list of new columns and pass it to toDF

df_list = ["newName_1", “newName_2", “newName_3", “newName_4"]
renamed_df = df.toDF(*df_list)
renamed_df.show()

answered Apr 2, 2022 at 9:42

sargupta

1,03316 silver badges28 bronze badges

Comments

Rayanaay · Accepted Answer · 2022-08-25 08:46:03Z

1

This is an easy way to rename multiple columns with a loop:

cols_to_rename = ["col1","col2","col3"]

for col in cols_to_rename:
  df = df.withColumnRenamed(col,"new_{}".format(col))

answered Aug 25, 2022 at 8:46

Rayanaay

1051 silver badge10 bronze badges

Comments

John Haberstroh · Accepted Answer · 2022-12-19 23:30:36Z

1

The closest statement to df.columns = new_column_name_list is:

import pyspark.sql.functions as F
df = df.select(*[F.col(name_old).alias(name_new) 
                 for (name_old, name_new) 
                 in zip(df.columns, new_column_name_list)]

This doesn't require any rarely-used functions, and emphasizes some patterns that are very helpful in Spark. You could also break up the steps if you find this one-liner to be doing too many things:

import pyspark.sql.functions as F
column_mapping = [F.col(name_old).alias(name_new) 
                  for (name_old, name_new) 
                  in zip(df.columns, new_column_name_list)]
df = df.select(*column_mapping)

answered Dec 19, 2022 at 23:30

John Haberstroh

5304 silver badges12 bronze badges

Comments

KaranSingh · Accepted Answer · 2023-05-15 13:40:32Z

1

To apply any generic function on the spark dataframe columns and then rename the column names, can use the quinn library. Please refer example code:

import quinn
def lower_case(col):
  return col.lower()

df_ = quinn.with_columns_renamed(lower_case)(df)

lower_case is the function name and df is the initial spark dataframe

If you get an error importing quinn library. Use example code below:

%pip install quinn

answered May 15, 2023 at 13:40

KaranSingh

6406 silver badges14 bronze badges

Collectives™ on Stack Overflow

How to change dataframe column names in PySpark?

25 Answers 25

18 Comments

5 Comments

4 Comments

1 Comment

Comments

Comments

1 Comment

2 Comments

Comments

2 Comments

Comments

Comments

Comments

Comments

Comments

Comments

Comments

Comments

Comments

Comments

Comments

Comments

Comments

Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

25 Answers 25

18 Comments

5 Comments

4 Comments

1 Comment

Comments

Comments

1 Comment

2 Comments

Comments

2 Comments

Comments

Comments

Comments

Comments

Comments

Comments

Comments

Comments

Comments

Comments

Comments

Comments

Comments

Comments

Comments

Linked

Related