PySpark filter works incorrectly with one True built using lambda funtion

I was debugging a function and I have encountered a misterious thing:

Given a PySpark dataframe with one column (name_id) I build another (is_number) one using a lambda function to see if name_id is a string whose characters are numbers.

The resulting dataframe (df) looks like this:

df.show(4, False)

name_id	is_number
0001	true
0002	true
0003	true
0004	true

I need to count the number of True values, so I do the following:

df.where(F.col("is_number")==True).count()

Three?? Really? What is happening here?

It gets stranger:

df.groupBy("is_number").count().show(4, False)

is_number	count
true	4

It looks like all True values are the same, BUT:

df.groupBy("is_number").count().where(F.col("is_number")=="True").collect()[0]["count"])

Again, it looks like applying the where function eliminates one True value. The filter function works the same.

Additionally I have detected which True value is the one excluded, and it is the first one.

df.where(F.col("is_number")==True).show(4, False)

name_id	is_number
0002	true
0003	true
0004	true

Other things I have tried: Expressing True as not False doesn't work. "true" values shown are True representations, not a string "true". Using EqNullSafe() instead of == doesn't work.

Any ideas? This is a complete nonsense for me!

Thank you in advance!

asked Oct 2, 2023 at 19:52

Python Puzzle

Can you do printSchema() of df?

partlov
– partlov

2023-10-02 20:02:46 +00:00
Commented Oct 2, 2023 at 20:02
root |-- name_id: string (nullable = true) |-- is_number: boolean (nullable = true)

Python Puzzle
– Python Puzzle

2023-10-02 20:15:11 +00:00
Commented Oct 2, 2023 at 20:15
Update: It works if I do df=df.cache(). This is not optimal, since it might raise errors with large dataframes.

Python Puzzle
– Python Puzzle

2023-10-03 07:59:16 +00:00
Commented Oct 3, 2023 at 7:59
Update: It works with PySpark 3.1, but not with PySpark 2.4.

Python Puzzle
– Python Puzzle

2023-10-09 20:00:05 +00:00
Commented Oct 9, 2023 at 20:00

Add a comment |

0 Your Answer

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

Collectives™ on Stack Overflow

PySpark filter works incorrectly with one True built using lambda funtion

0

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Your Answer

Sign up or log in

Post as a guest