Replace values in multiple columns based on value of one column

Question

I have a PySpark dataframe with multiple columns (around 320)

I have to find keyword baz in col A. in case the baz is found, then replace the existing value in all columns listed in the list columns_for_replacement with None

columns_for_replacement = ["B", "C", "D", "E", "F", "G", "H", "I"]

I am trying to modify the below code to do the same:

for i in columns_for_replacement:
    df = df.withColumn(i,when((col(i)=='baz'),None).otherwise(col(i)))

The above code works only for the specific column which is not my expected requirement.

Base dataframe:

A   B   C   D   E   F   G   H   I   J   
baz abc abc abc abc abc abc abc abc abc
baz abc abc abc abc abc abc abc abc abc
def abc abc abc abc abc abc abc abc abc
baz abc abc abc abc abc abc abc abc abc
map abc abc abc abc abc abc abc abc abc
baz abc abc abc abc abc abc abc abc abc
noo abc abc abc abc abc abc abc abc abc

Expected dataframe:

A   B   C   D   E   F   G   H   I   J   
baz                                 abc
baz                                 abc
def abc abc abc abc abc abc abc abc abc
baz                                 abc
map abc abc abc abc abc abc abc abc abc
baz                                 abc
noo abc abc abc abc abc abc abc abc abc

mck · Accepted Answer · 2021-02-08 10:10:06Z

2

I think you meant to use column A in the when condition, not column i:

columns_for_replacement = ["B", "C", "D", "E", "F", "G", "H", "I"]

for i in columns_for_replacement:
    df = df.withColumn(i, when((col("A")=='baz'), lit(None)).otherwise(col(i)))

Another way to do this would be to do a select:

columns_for_replacement = ["B", "C", "D", "E", "F", "G", "H", "I"]

df2 = df.select([
    when((col("A")=='baz'), lit(None)).otherwise(col(c)).alias(c)
    if c in columns_for_replacement 
    else col(c)
    for c in df.columns
])

answered Feb 8, 2021 at 10:10

mck

42.7k13 gold badges44 silver badges62 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

snark Over a year ago

I bet the 2nd approach is more efficient too, based on the Note in the docs for DataFrame.withColumn(): This method introduces a projection internally. Therefore, calling it multiple times, for instance, via loops in order to add multiple columns can generate big plans which can cause performance issues and even StackOverflowException. To avoid this, use select() with the multiple columns at once.

blackbishop · Accepted Answer · 2021-02-08 10:17:08Z

You need to check column A :

from pyspark.sql import functions as F

df1 = df.select(
    col("A"),
    *[F.when(F.col("A") == "baz", F.lit(None)).otherwise(F.col(c)).alias(c) for c in columns_for_replacement],
    col("J")
)

df1.show()

#+---+----+----+----+----+----+----+----+----+---+
#|  A|   B|   C|   D|   E|   F|   G|   H|   I|  J|
#+---+----+----+----+----+----+----+----+----+---+
#|baz|null|null|null|null|null|null|null|null|abc|
#|baz|null|null|null|null|null|null|null|null|abc|
#|def| abc| abc| abc| abc| abc| abc| abc| abc|abc|
#|baz|null|null|null|null|null|null|null|null|abc|
#|map| abc| abc| abc| abc| abc| abc| abc| abc|abc|
#|baz|null|null|null|null|null|null|null|null|abc|
#|noo| abc| abc| abc| abc| abc| abc| abc| abc|abc|
#+---+----+----+----+----+----+----+----+----+---+

Or more generic to avoid select each column other than in the list to replace:

df1 = df.select(
    *[F.when(F.col("A") == "baz", F.lit(None)).otherwise(F.col(c)).alias(c) for c in columns_for_replacement],
    *[F.col(c) for c in df.columns if c not in columns_for_replacement]
).toDF(*df.columns)

Collectives™ on Stack Overflow

Replace values in multiple columns based on value of one column

2 Answers 2

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related