0

I was debugging a function and I have encountered a misterious thing:

Given a PySpark dataframe with one column (name_id) I build another (is_number) one using a lambda function to see if name_id is a string whose characters are numbers.

The resulting dataframe (df) looks like this:

df.show(4, False)
name_id is_number
0001 true
0002 true
0003 true
0004 true

I need to count the number of True values, so I do the following:

df.where(F.col("is_number")==True).count()

3

Three?? Really? What is happening here?

It gets stranger:

df.groupBy("is_number").count().show(4, False)
is_number count
true 4

It looks like all True values are the same, BUT:

df.groupBy("is_number").count().where(F.col("is_number")=="True").collect()[0]["count"])

3

Again, it looks like applying the where function eliminates one True value. The filter function works the same.

Additionally I have detected which True value is the one excluded, and it is the first one.

df.where(F.col("is_number")==True).show(4, False)
name_id is_number
0002 true
0003 true
0004 true

Other things I have tried: Expressing True as not False doesn't work. "true" values shown are True representations, not a string "true". Using EqNullSafe() instead of == doesn't work.

Any ideas? This is a complete nonsense for me!

Thank you in advance!

4
  • Can you do printSchema() of df? Commented Oct 2, 2023 at 20:02
  • root |-- name_id: string (nullable = true) |-- is_number: boolean (nullable = true) Commented Oct 2, 2023 at 20:15
  • Update: It works if I do df=df.cache(). This is not optimal, since it might raise errors with large dataframes. Commented Oct 3, 2023 at 7:59
  • Update: It works with PySpark 3.1, but not with PySpark 2.4. Commented Oct 9, 2023 at 20:00

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.