I am using spark 2.4.0 I am observing a strange behavior while using count function to aggregate.
from pyspark.sql import functions as F
tst=sqlContext.createDataFrame([(1,2),(1,5),(2,None),(2,3),(3,None),(3,None)],schema=['col1','col2'])
tst.show()
+----+----+
|col1|col2|
+----+----+
| 1| 2|
| 1| 5|
| 2|null|
| 2| 3|
| 3|null|
| 3|null|
+----+----+
tst.groupby('col1').agg(F.count('col2')).show()
+----+-----------+
|col1|count(col2)|
+----+-----------+
| 1| 2|
| 3| 0|
| 2| 1|
+----+-----------+
Here you can see that the null values are not counted. I searched for the docus, but no where it is mentioned that the function count does not count null values. More surprising for me is this
tst.groupby('col1').agg(F.count(F.col('col2').isNull())).show()
+----+---------------------+
|col1|count((col2 IS NULL))|
+----+---------------------+
| 1| 2|
| 3| 2|
| 2| 2|
+----+---------------------+
Here I am totally confused. When I use isNull(), shouldn't it count only null values? Why is it counting all the values?
Any thing i am missing?