1

I am using spark 2.4.0 I am observing a strange behavior while using count function to aggregate.

from pyspark.sql import functions as F
tst=sqlContext.createDataFrame([(1,2),(1,5),(2,None),(2,3),(3,None),(3,None)],schema=['col1','col2'])
tst.show()
+----+----+
|col1|col2|
+----+----+
|   1|   2|
|   1|   5|
|   2|null|
|   2|   3|
|   3|null|
|   3|null|
+----+----+

tst.groupby('col1').agg(F.count('col2')).show()
+----+-----------+
|col1|count(col2)|
+----+-----------+
|   1|          2|
|   3|          0|
|   2|          1|
+----+-----------+

Here you can see that the null values are not counted. I searched for the docus, but no where it is mentioned that the function count does not count null values. More surprising for me is this

tst.groupby('col1').agg(F.count(F.col('col2').isNull())).show()
+----+---------------------+
|col1|count((col2 IS NULL))|
+----+---------------------+
|   1|                    2|
|   3|                    2|
|   2|                    2|
+----+---------------------+

Here I am totally confused. When I use isNull(), shouldn't it count only null values? Why is it counting all the values?

Any thing i am missing?

1 Answer 1

2

In both cases the results that you see are the expected ones.

Concerning the first example: Checking the Scala source of count there is a subtle difference between count(*) and count('col2'):

FUNC(*) - Returns the total number of retrieved rows, including rows containing null.
FUNC(expr[, expr...]) - Returns the number of rows for which the supplied expression(s) are all non-null.

This explains why the null entries are not counted.

If you change the code to

tst.groupby('col1').agg(F.count('*')).show()

you get

+----+--------+
|col1|count(1)|
+----+--------+
|   1|       2|
|   3|       2|
|   2|       2|
+----+--------+

About the second part: the expression F.col('col2').isNull() returns a boolean value. No matter what the actual value of this boolean is, the row is counted and therefore you see a 2.

Sign up to request clarification or add additional context in comments.

1 Comment

Thanks a lot. That explains a lot. The second part , i must have used in a filter and then counted.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.