PySpark replace null in column with value in other column

Question

I want to replace null values in one column with the values in an adjacent column ,for example if i have

A|B
0,1
2,null
3,null
4,2

I want it to be:

A|B
0,1
2,2
3,3
4,2

Tried with

df.na.fill(df.A,"B")

But didnt work, it says value should be a float, int, long, string, or dict

Any ideas?

Abhishek Gupta · Accepted Answer · 2021-03-22 17:29:44Z

74

We can use coalesce

from pyspark.sql.functions import coalesce
    
df.withColumn("B",coalesce(df.B,df.A))

edited Mar 22, 2021 at 17:29

Abhishek Gupta

4,21626 silver badges27 bronze badges

answered Mar 24, 2017 at 4:33

Luis Leal

3,5545 gold badges32 silver badges60 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

user8276908 Over a year ago

This solution is missing a from pyspark.sql.functions import coalesce

Rags · Accepted Answer · 2017-03-24 04:44:05Z

17

Another Answer.

If the below df1 your dataframe

rd1 = sc.parallelize([(0,1), (2,None), (3,None), (4,2)])
df1 = rd1.toDF(['A', 'B'])

from pyspark.sql.functions import when
df1.select('A',
           when( df1.B.isNull(), df1.A).otherwise(df1.B).alias('B')
          )\
   .show()

answered Mar 24, 2017 at 4:44

Rags

1,89118 silver badges19 bronze badges

Comments

Pushkr · Accepted Answer · 2017-03-24 03:20:45Z

3

df.rdd.map(lambda row: row if row[1] else Row(a=row[0],b=row[0])).toDF().show()

answered Mar 24, 2017 at 3:20

Pushkr

3,62921 silver badges32 bronze badges

1 Comment

Luis Leal Over a year ago

Thank you, at the end , i used coallesce : df.withColumn("B",coalesce(df.B,df.A)) But your answer is helpful in case anybody else tries this.

Tomasz Bartkowiak · Accepted Answer · 2021-11-16 12:00:23Z

Note: coalesce will not replace NaN values, only nulls:

import pyspark.sql.functions as F

>>> cDf = spark.createDataFrame([(None, None), (1, None), (None, 2)], ("a", "b"))
>>> cDf.show()
+----+----+
|   a|   b|
+----+----+
|null|null|
|   1|null|
|null|   2|
+----+----+

>>> cDf.select(F.coalesce(cDf["a"], cDf["b"])).show()
+--------------+
|coalesce(a, b)|
+--------------+
|          null|
|             1|
|             2|
+--------------+

Let's now create a pandas.DataFrame with None entries, convert it into spark.DataFrame and use coalesce again:

>>> cDf_from_pd = spark.createDataFrame(pd.DataFrame({'a': [None, 1, None], 'b': [None, None, 2]}))
>>> cDf_from_pd.show()
+---+---+
|  a|  b|
+---+---+
|NaN|NaN|
|1.0|NaN|
|NaN|2.0|
+---+---+

>>> cDf_from_pd.select(F.coalesce(cDf_from_pd["a"], cDf_from_pd["b"])).show()
+--------------+
|coalesce(a, b)|
+--------------+
|           NaN|
|           1.0|
|           NaN|
+--------------+

In which case you'll need to first call replace on your DataFrame to convert NaNs to nulls.

Collectives™ on Stack Overflow

PySpark replace null in column with value in other column

4 Answers 4

1 Comment

Comments

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

1 Comment

Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related