Spark: Replace missing values with values from another column

Question

Suppose you have a Spark dataframe containing some null values, and you would like to replace the values of one column with the values from another if present. In Python/Pandas you can use the fillna() function to do this quite nicely:

df = spark.createDataFrame([('a', 'b', 'c'),(None,'e', 'f'),(None,None,'i')], ['c1','c2','c3'])
DF = df.toPandas()
DF['c1'].fillna(DF['c2']).fillna(DF['c3'])

How can this be done using Pyspark?

eliasah · Accepted Answer · 2017-02-09 18:34:16Z

11

You need to use the coalesce function :

cDf = spark.createDataFrame([(None, None), (1, None), (None, 2)], ("a", "b"))
cDF.show()
# +----+----+
# |   a|   b|
# +----+----+
# |null|null|
# |   1|null|
# |null|   2|
# +----+----+

cDf.select(coalesce(cDf["a"], cDf["b"])).show()
# +--------------+
# |coalesce(a, b)|
# +--------------+
# |          null|
# |             1|
# |             2|
# +--------------+

cDf.select('*', coalesce(cDf["a"], lit(0.0))).show()
# +----+----+----------------+
# |   a|   b|coalesce(a, 0.0)|
# +----+----+----------------+
# |null|null|             0.0|
# |   1|null|             1.0|
# |null|   2|             0.0|
# +----+----+----------------+

You can also apply coalesce on multiple columns :

cDf.select(coalesce(cDf["a"], cDf["b"], lit(0))).show()
# ...

This example is taken from the pyspark.sql API documentation.

edited Feb 9, 2017 at 18:34

answered Feb 9, 2017 at 16:52

eliasah

40.5k12 gold badges128 silver badges159 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

evilpilotfish Over a year ago

Excellent. Worth noting that multiple columns can be passed for filling values cDf.select(coalesce(cDf["a"], cDf["b"], lit(0))).show()

Ali Over a year ago

Just make sure the column values are "null" and not "empty" strings. I had this problem and I had to explicitly make "empty" values of one of columns as "null" using df.withColumn('myCol', when(col('myCol') == '', None).otherwise(col('myCol')))

Collectives™ on Stack Overflow

Spark: Replace missing values with values from another column

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related