Pyspark - replace null values in column with distinct column value

Question

How do I fill this one column category of Nulls with the distinct value in it?

+---++--------+----------+
| id||category|      Date|
+---+---------+----------+
| A1|     Null|2010-01-02|
| A1|     Null|2010-01-03|
| A1|    Nixon|2010-01-04|
| A1|     Null|2010-01-05|
| A9|     Null|2010-05-02|
| A9|  Leonard|2010-05-03|
| A9|     Null|2010-05-04|
| A9|     Null|2010-05-05|
+---+---------+----------+

Desired Dataframe:

+---++--------+----------+
| id||category|      Date|
+---+---------+----------+
| A1|    Nixon|2010-01-02|
| A1|    Nixon|2010-01-03|
| A1|    Nixon|2010-01-04|
| A1|    Nixon|2010-01-05|
| A9|  Leonard|2010-05-02|
| A9|  Leonard|2010-05-03|
| A9|  Leonard|2010-05-04|
| A9|  Leonard|2010-05-05|
+---+---------+----------+

I tried:

w = Window().partitionBy("ID").orderBy("Date")
df = df.withColumn("category", F.when(col("category").isNull(), col("category")\
.distinct().over(w))\
.otherwise(col("category")))

I also tried:

df = df.fillna({'category': col('category').distinct()})

I have also tried:

df = df.withColumn('category', when(df.category.isNull(), df.category.distinct()).otherwise(df.category))

df = df.groupby(['category']).fillna(method='ffill') and the. do a bfill — Joe Ferndz
– Joe Ferndz, Commented Oct 9, 2020 at 1:10
if there is only one distinct value for each ID, then just: df_new = df.withColumn('category', F.first('category',True).over(Window.partitionBy('id'))) — jxc
– jxc, Commented Oct 9, 2020 at 1:14
@Starbucks, the function first with the 2nd argument ignorenulls=True should pick the first non-NULL value from the same partition. if there is any non-Null values, it should not be all null-out. spark.apache.org/docs/latest/api/python/… — jxc
– jxc, Commented Oct 9, 2020 at 1:21

Surya · Accepted Answer · 2020-10-09 03:34:51Z

1

You can use first() with ignorenulls parameter as True.
Also, use rowsBetween(-sys.maxsize, sys.maxsize) on your window.

from pyspark.sql import functions as F
from pyspark.sql.functions import *
from pyspark.sql.window import Window
import sys

w = Window().partitionBy("id").orderBy("Date")

df.withColumn("new", F.first('category', True).over(w.rowsBetween(-sys.maxsize, sys.maxsize)))\
        .orderBy("id", "Date").show()

+---+--------+----------+
| id|category|      Date|
+---+--------+----------+
| A1|   Nixon|2010-01-02|
| A1|   Nixon|2010-01-03|
| A1|   Nixon|2010-01-04|
| A1|   Nixon|2010-01-05|
| A9| Leonard|2010-05-02|
| A9| Leonard|2010-05-03|
| A9| Leonard|2010-05-04|
| A9| Leonard|2010-05-05|
+---+--------+----------+

answered Oct 9, 2020 at 3:34

Surya

3,4293 gold badges22 silver badges36 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Pyspark - replace null values in column with distinct column value

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related