pyspark-strange behavior of count function inside agg

Question

I am using spark 2.4.0 I am observing a strange behavior while using count function to aggregate.

from pyspark.sql import functions as F
tst=sqlContext.createDataFrame([(1,2),(1,5),(2,None),(2,3),(3,None),(3,None)],schema=['col1','col2'])
tst.show()
+----+----+
|col1|col2|
+----+----+
|   1|   2|
|   1|   5|
|   2|null|
|   2|   3|
|   3|null|
|   3|null|
+----+----+

tst.groupby('col1').agg(F.count('col2')).show()
+----+-----------+
|col1|count(col2)|
+----+-----------+
|   1|          2|
|   3|          0|
|   2|          1|
+----+-----------+

Here you can see that the null values are not counted. I searched for the docus, but no where it is mentioned that the function count does not count null values. More surprising for me is this

tst.groupby('col1').agg(F.count(F.col('col2').isNull())).show()
+----+---------------------+
|col1|count((col2 IS NULL))|
+----+---------------------+
|   1|                    2|
|   3|                    2|
|   2|                    2|
+----+---------------------+

Here I am totally confused. When I use isNull(), shouldn't it count only null values? Why is it counting all the values?

Any thing i am missing?

werner · Accepted Answer · 2020-06-25 13:20:25Z

2

In both cases the results that you see are the expected ones.

Concerning the first example: Checking the Scala source of count there is a subtle difference between count(*) and count('col2'):

FUNC(*) - Returns the total number of retrieved rows, including rows containing null.
FUNC(expr[, expr...]) - Returns the number of rows for which the supplied expression(s) are all non-null.

This explains why the null entries are not counted.

If you change the code to

tst.groupby('col1').agg(F.count('*')).show()

you get

+----+--------+
|col1|count(1)|
+----+--------+
|   1|       2|
|   3|       2|
|   2|       2|
+----+--------+

About the second part: the expression F.col('col2').isNull() returns a boolean value. No matter what the actual value of this boolean is, the row is counted and therefore you see a 2.

answered Jun 25, 2020 at 13:20

werner

15k6 gold badges36 silver badges56 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Raghu Over a year ago

Thanks a lot. That explains a lot. The second part , i must have used in a filter and then counted.

Collectives™ on Stack Overflow

pyspark-strange behavior of count function inside agg

1 Answer 1

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related