How to add suffix and prefix to all columns in python/pyspark dataframe

Question

I have a data frame in pyspark with more than 100 columns. What I want to do is for all the column names I would like to add back ticks(`) at the start of the column name and end of column name.

For example:

column name  is testing user. I want `testing user`

Is there a method to do this in pyspark/python. when we apply the code it should return a data frame.

knanne · Accepted Answer · 2019-09-20 12:57:52Z

41

Use list comprehension in python.

from pyspark.sql import functions as F

df = ...

df_new = df.select([F.col(c).alias("`"+c+"`") for c in df.columns])

This method also gives you the option to add custom python logic within the alias() function like: "prefix_"+c+"_suffix" if c in list_of_cols_to_change else c

answered Sep 20, 2019 at 12:57

knanne

6586 silver badges11 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

appleboy · Accepted Answer · 2020-07-04 11:33:05Z

8

To add prefix or suffix:

Refer df.columns for list of columns ([col_1, col_2...]). This is the dataframe, for which we want to suffix/prefix column.

df.columns

Iterate through above list and create another list of columns with alias that can used inside select expression.

from pyspark.sql.functions import col

select_list = [col(col_name).alias("prefix_" + col_name)  for col_name in df.columns]

When using inside select, do not forget to unpack list with asterisk(*). We can assign it back to same or different df for use.

df.select(*select_list).show()
df = df.select(*select_list)

df.columns will now return list of new columns(aliased).

edited Jul 4, 2020 at 11:33

answered Jul 4, 2020 at 11:27

appleboy

6711 gold badge9 silver badges15 bronze badges

1 Comment

Krunal Patel Over a year ago

Thanks for the steps-breakdown. Using df.select in combination with pyspark.sql.functions col-method is a reliable way to do this since it maintains the mapping/alias applied & thus the order/schema is maintained after the rename operations. Below is the sample select_list content: [Column<b'XYZ AS prefix_XYZ'>, Column<b'ABC_ID AS prefix_ABC_ID'>]

Patrick ML · Accepted Answer · 2018-11-07 16:55:19Z

5

If you would like to add a prefix or suffix to multiple columns in a pyspark dataframe, you could use a for loop and .withColumnRenamed().

As an example, you might like:

def add_prefix(sdf, prefix):

      for c in sdf.columns:

          sdf = sdf.withColumnRenamed(c, '{}{}'.format(prefix, c))

      return sdf

You can amend sdf.columns as you see fit.

edited Nov 7, 2018 at 16:55

answered Nov 7, 2018 at 16:44

Patrick ML

611 silver badge2 bronze badges

Comments

Pushkr · Accepted Answer · 2017-04-01 19:04:38Z

3

You can use withColumnRenamed method of dataframe in combination with na to create new dataframe

df.na.withColumnRenamed('testing user', '`testing user`')

edit : suppose you have list of columns, you can do like -

old = "First Last Age"
new = ["`"+field+"`" for field in old.split()]
df.rdd.toDF(new)

output :

DataFrame[`First`: string, `Last`: string, `Age`: string]

edited Apr 1, 2017 at 19:04

answered Apr 1, 2017 at 18:01

Pushkr

3,62921 silver badges32 bronze badges

2 Comments

Pushkr Over a year ago

Updated my answer

Pushkr Over a year ago

if you are just trying to export data from mysql to hive, you might as well just use sqoop , unless you are performing any specialized processing on data , you dont have to go thru spark.

Dwindwin · Accepted Answer · 2019-06-03 14:02:56Z

1

I had a dataframe that I duplicated twice then joined together. Since both had the same columns names I used :

df = reduce(lambda df, idx: df.withColumnRenamed(list(df.schema.names)[idx],
                                                 list(df.schema.names)[idx] + '_prec'),
            range(len(list(df.schema.names))),
            df)

Every columns in my dataframe then had the '_prec' suffix which allowed me to do sweet stuff

answered Jun 3, 2019 at 14:02

Dwindwin

111 bronze badge

2 Comments

Ralf Over a year ago

Could you explain in more detail how this answers the question?

Dwindwin Over a year ago

The question asked was how to had a suffix or a prefix to all the columns of a dataframe. Here I added a suffix but you can do both by simply changing the second parameter of withColumnRenamed. In the person's case it would be "'" + list(df.schema.names)[idx] + "'")

Collectives™ on Stack Overflow

How to add suffix and prefix to all columns in python/pyspark dataframe

5 Answers 5

Comments

1 Comment

Comments

2 Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

Comments

1 Comment

Comments

2 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related