pyspark replace multiple values with null in dataframe

Question

I have a dataframe (df) and within the dataframe I have a column user_id

df = sc.parallelize([(1, "not_set"),
                     (2, "user_001"),
                     (3, "user_002"),
                     (4, "n/a"),
                     (5, "N/A"),
                     (6, "userid_not_set"),
                     (7, "user_003"),
                     (8, "user_004")]).toDF(["key", "user_id"])

df:

+---+--------------+
|key|       user_id|
+---+--------------+
|  1|       not_set|
|  2|      user_003|
|  3|      user_004|
|  4|           n/a|
|  5|           N/A|
|  6|userid_not_set|
|  7|      user_003|
|  8|      user_004|
+---+--------------+

I would like to replace the following values: not_set, n/a, N/A and userid_not_set with null.

It would be good if I could add any new values to a list and they to could be changed.

I am currently using a CASE statement within spark.sql to preform this and would like to change this to pyspark.

pault · Accepted Answer · 2018-12-24 19:15:29Z

8

None inside the when() function corresponds to the null. In case you wish to fill in anything else instead of null, you have to fill it in it's place.

from pyspark.sql.functions import col    
df =  df.withColumn(
    "user_id",
    when(
        col("user_id").isin('not_set', 'n/a', 'N/A', 'userid_not_set'),
        None
    ).otherwise(col("user_id"))
)
df.show()
+---+--------+
|key| user_id|
+---+--------+
|  1|    null|
|  2|user_001|
|  3|user_002|
|  4|    null|
|  5|    null|
|  6|    null|
|  7|user_003|
|  8|user_004|
+---+--------+

edited Dec 24, 2018 at 19:15

pault

43.7k17 gold badges121 silver badges161 bronze badges

answered Dec 21, 2018 at 13:03

cph_sto

7,70714 gold badges49 silver badges91 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

pault Over a year ago

I'd like to point out that when returns null if the condition fails and no otherwise is supplied. So in this case, the following is equivalent, but a little more succinct: df.withColumn("user_id", when(~col("user_id").isin('not_set', 'n/a', 'N/A', 'userid_not_set'), col("user_id")))

Vamsi Prabhala · Accepted Answer · 2018-12-21 13:08:01Z

1

You can use the in-built when function, which is the equivalent of a case expression.

from pyspark.sql import functions as f
df.select(df.key,f.when(df.user_id.isin(['not_set', 'n/a', 'N/A']),None).otherwise(df.user_id)).show()

Also the values needed can be stored in a list and be referenced.

val_list = ['not_set', 'n/a', 'N/A']
df.select(df.key,f.when(df.user_id.isin(val_list),None).otherwise(df.user_id)).show()

edited Dec 21, 2018 at 13:08

answered Dec 21, 2018 at 12:54

Vamsi Prabhala

49.4k4 gold badges41 silver badges64 bronze badges

1 Comment

Data_101 Over a year ago

I get this error "name 'null' is not defined", if I change the null to a string then it works.

pault · Accepted Answer · 2019-01-02 16:58:30Z

0

PFB few approaches. I am assuming that all the legitimate user IDs starts with "user_". Please try below code.

from pyspark.sql.functions import *
df.withColumn(
    "user_id",
    when(col("user_id").startswith("user_"),col("user_id")).otherwise(None)
).show()

Another One.

cond = """case when user_id in ('not_set', 'n/a', 'N/A', 'userid_not_set') then null
                else user_id
            end"""

df.withColumn("ID", expr(cond)).show()

Another One.

cond = """case when user_id like 'user_%' then user_id
                else null
            end"""

df.withColumn("ID", expr(cond)).show()

Another one.

df.withColumn(
    "user_id",
    when(col("user_id").rlike("user_"),col("user_id")).otherwise(None)
).show()

edited Jan 2, 2019 at 16:58

pault

43.7k17 gold badges121 silver badges161 bronze badges

answered Dec 31, 2018 at 11:34

Neeraj Bhadani

3,14021 silver badges28 bronze badges

Collectives™ on Stack Overflow

pyspark replace multiple values with null in dataframe

3 Answers 3

1 Comment

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related