I was debugging a function and I have encountered a misterious thing:
Given a PySpark dataframe with one column (name_id) I build another (is_number) one using a lambda function to see if name_id is a string whose characters are numbers.
The resulting dataframe (df) looks like this:
df.show(4, False)
| name_id | is_number |
|---|---|
| 0001 | true |
| 0002 | true |
| 0003 | true |
| 0004 | true |
I need to count the number of True values, so I do the following:
df.where(F.col("is_number")==True).count()
3
Three?? Really? What is happening here?
It gets stranger:
df.groupBy("is_number").count().show(4, False)
| is_number | count |
|---|---|
| true | 4 |
It looks like all True values are the same, BUT:
df.groupBy("is_number").count().where(F.col("is_number")=="True").collect()[0]["count"])
3
Again, it looks like applying the where function eliminates one True value. The filter function works the same.
Additionally I have detected which True value is the one excluded, and it is the first one.
df.where(F.col("is_number")==True).show(4, False)
| name_id | is_number |
|---|---|
| 0002 | true |
| 0003 | true |
| 0004 | true |
Other things I have tried: Expressing True as not False doesn't work. "true" values shown are True representations, not a string "true". Using EqNullSafe() instead of == doesn't work.
Any ideas? This is a complete nonsense for me!
Thank you in advance!
printSchema()ofdf?