0

How do I fill this one column category of Nulls with the distinct value in it?

+---++--------+----------+
| id||category|      Date|
+---+---------+----------+
| A1|     Null|2010-01-02|
| A1|     Null|2010-01-03|
| A1|    Nixon|2010-01-04|
| A1|     Null|2010-01-05|
| A9|     Null|2010-05-02|
| A9|  Leonard|2010-05-03|
| A9|     Null|2010-05-04|
| A9|     Null|2010-05-05|
+---+---------+----------+

Desired Dataframe:

+---++--------+----------+
| id||category|      Date|
+---+---------+----------+
| A1|    Nixon|2010-01-02|
| A1|    Nixon|2010-01-03|
| A1|    Nixon|2010-01-04|
| A1|    Nixon|2010-01-05|
| A9|  Leonard|2010-05-02|
| A9|  Leonard|2010-05-03|
| A9|  Leonard|2010-05-04|
| A9|  Leonard|2010-05-05|
+---+---------+----------+

I tried:

w = Window().partitionBy("ID").orderBy("Date")
df = df.withColumn("category", F.when(col("category").isNull(), col("category")\
.distinct().over(w))\
.otherwise(col("category")))

I also tried:

df = df.fillna({'category': col('category').distinct()})

I have also tried:

df = df.withColumn('category', when(df.category.isNull(), df.category.distinct()).otherwise(df.category))
6
  • df = df.groupby(['category']).fillna(method='ffill') and the. do a bfill Commented Oct 9, 2020 at 1:10
  • This is Pyspark, not Pandas Commented Oct 9, 2020 at 1:12
  • if there is only one distinct value for each ID, then just: df_new = df.withColumn('category', F.first('category',True).over(Window.partitionBy('id'))) Commented Oct 9, 2020 at 1:14
  • @jxc, thanks, but this will Null out all of my data Commented Oct 9, 2020 at 1:17
  • @Starbucks, the function first with the 2nd argument ignorenulls=True should pick the first non-NULL value from the same partition. if there is any non-Null values, it should not be all null-out. spark.apache.org/docs/latest/api/python/… Commented Oct 9, 2020 at 1:21

1 Answer 1

1

You can use first() with ignorenulls parameter as True.
Also, use rowsBetween(-sys.maxsize, sys.maxsize) on your window.

from pyspark.sql import functions as F
from pyspark.sql.functions import *
from pyspark.sql.window import Window
import sys

w = Window().partitionBy("id").orderBy("Date")

df.withColumn("new", F.first('category', True).over(w.rowsBetween(-sys.maxsize, sys.maxsize)))\
        .orderBy("id", "Date").show()

+---+--------+----------+
| id|category|      Date|
+---+--------+----------+
| A1|   Nixon|2010-01-02|
| A1|   Nixon|2010-01-03|
| A1|   Nixon|2010-01-04|
| A1|   Nixon|2010-01-05|
| A9| Leonard|2010-05-02|
| A9| Leonard|2010-05-03|
| A9| Leonard|2010-05-04|
| A9| Leonard|2010-05-05|
+---+--------+----------+
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.