1

I have a PySpark dataframe with multiple columns (around 320)

I have to find keyword baz in col A. in case the baz is found, then replace the existing value in all columns listed in the list columns_for_replacement with None

columns_for_replacement = ["B", "C", "D", "E", "F", "G", "H", "I"]

I am trying to modify the below code to do the same:

for i in columns_for_replacement:
    df = df.withColumn(i,when((col(i)=='baz'),None).otherwise(col(i)))  

The above code works only for the specific column which is not my expected requirement.

Base dataframe:

A   B   C   D   E   F   G   H   I   J   
baz abc abc abc abc abc abc abc abc abc
baz abc abc abc abc abc abc abc abc abc
def abc abc abc abc abc abc abc abc abc
baz abc abc abc abc abc abc abc abc abc
map abc abc abc abc abc abc abc abc abc
baz abc abc abc abc abc abc abc abc abc
noo abc abc abc abc abc abc abc abc abc

Expected dataframe:

A   B   C   D   E   F   G   H   I   J   
baz                                 abc
baz                                 abc
def abc abc abc abc abc abc abc abc abc
baz                                 abc
map abc abc abc abc abc abc abc abc abc
baz                                 abc
noo abc abc abc abc abc abc abc abc abc

2 Answers 2

2

I think you meant to use column A in the when condition, not column i:

columns_for_replacement = ["B", "C", "D", "E", "F", "G", "H", "I"]

for i in columns_for_replacement:
    df = df.withColumn(i, when((col("A")=='baz'), lit(None)).otherwise(col(i)))

Another way to do this would be to do a select:

columns_for_replacement = ["B", "C", "D", "E", "F", "G", "H", "I"]

df2 = df.select([
    when((col("A")=='baz'), lit(None)).otherwise(col(c)).alias(c)
    if c in columns_for_replacement 
    else col(c)
    for c in df.columns
])
Sign up to request clarification or add additional context in comments.

1 Comment

I bet the 2nd approach is more efficient too, based on the Note in the docs for DataFrame.withColumn(): This method introduces a projection internally. Therefore, calling it multiple times, for instance, via loops in order to add multiple columns can generate big plans which can cause performance issues and even StackOverflowException. To avoid this, use select() with the multiple columns at once.
1

You need to check column A :

from pyspark.sql import functions as F

df1 = df.select(
    col("A"),
    *[F.when(F.col("A") == "baz", F.lit(None)).otherwise(F.col(c)).alias(c) for c in columns_for_replacement],
    col("J")
)

df1.show()

#+---+----+----+----+----+----+----+----+----+---+
#|  A|   B|   C|   D|   E|   F|   G|   H|   I|  J|
#+---+----+----+----+----+----+----+----+----+---+
#|baz|null|null|null|null|null|null|null|null|abc|
#|baz|null|null|null|null|null|null|null|null|abc|
#|def| abc| abc| abc| abc| abc| abc| abc| abc|abc|
#|baz|null|null|null|null|null|null|null|null|abc|
#|map| abc| abc| abc| abc| abc| abc| abc| abc|abc|
#|baz|null|null|null|null|null|null|null|null|abc|
#|noo| abc| abc| abc| abc| abc| abc| abc| abc|abc|
#+---+----+----+----+----+----+----+----+----+---+

Or more generic to avoid select each column other than in the list to replace:

df1 = df.select(
    *[F.when(F.col("A") == "baz", F.lit(None)).otherwise(F.col(c)).alias(c) for c in columns_for_replacement],
    *[F.col(c) for c in df.columns if c not in columns_for_replacement]
).toDF(*df.columns)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.