17

Hi I have the following issue:

numeric.registerTempTable("numeric"). 

All the values that I want to filter on are literal null strings and not N/A or Null values.

I tried these three options:

  1. numeric_filtered = numeric.filter(numeric['LOW'] != 'null').filter(numeric['HIGH'] != 'null').filter(numeric['NORMAL'] != 'null')

  2. numeric_filtered = numeric.filter(numeric['LOW'] != 'null' AND numeric['HIGH'] != 'null' AND numeric['NORMAL'] != 'null')

  3. sqlContext.sql("SELECT * from numeric WHERE LOW != 'null' AND HIGH != 'null' AND NORMAL != 'null'")

Unfortunately, numeric_filtered is always empty. I checked and numeric has data that should be filtered based on these conditions.

Here are some sample values:

Low High Normal

3.5 5.0 null

2.0 14.0 null

null 38.0 null

null null null

1.0 null 4.0

0

2 Answers 2

29

Your are using logical conjunction (AND). It means that all columns have to be different than 'null' for row to be included. Lets illustrate that using filter version as an example:

numeric = sqlContext.createDataFrame([
    ('3.5,', '5.0', 'null'), ('2.0', '14.0', 'null'),  ('null', '38.0', 'null'),
    ('null', 'null', 'null'),  ('1.0', 'null', '4.0')],
    ('low', 'high', 'normal'))

numeric_filtered_1 = numeric.where(numeric['LOW'] != 'null')
numeric_filtered_1.show()

## +----+----+------+
## | low|high|normal|
## +----+----+------+
## |3.5,| 5.0|  null|
## | 2.0|14.0|  null|
## | 1.0|null|   4.0|
## +----+----+------+

numeric_filtered_2 = numeric_filtered_1.where(
    numeric_filtered_1['NORMAL'] != 'null')
numeric_filtered_2.show()

## +---+----+------+
## |low|high|normal|
## +---+----+------+
## |1.0|null|   4.0|
## +---+----+------+

numeric_filtered_3 = numeric_filtered_2.where(
    numeric_filtered_2['HIGH'] != 'null')
numeric_filtered_3.show()

## +---+----+------+
## |low|high|normal|
## +---+----+------+
## +---+----+------+

All remaining methods you've tried follow exactly the same schema. What you need here is a logical disjunction (OR).

from pyspark.sql.functions import col 

numeric_filtered = df.where(
    (col('LOW')    != 'null') | 
    (col('NORMAL') != 'null') |
    (col('HIGH')   != 'null'))
numeric_filtered.show()

## +----+----+------+
## | low|high|normal|
## +----+----+------+
## |3.5,| 5.0|  null|
## | 2.0|14.0|  null|
## |null|38.0|  null|
## | 1.0|null|   4.0|
## +----+----+------+

or with raw SQL:

numeric.registerTempTable("numeric")
sqlContext.sql("""SELECT * FROM numeric
    WHERE low != 'null' OR normal != 'null' OR high != 'null'"""
).show()

## +----+----+------+
## | low|high|normal|
## +----+----+------+
## |3.5,| 5.0|  null|
## | 2.0|14.0|  null|
## |null|38.0|  null|
## | 1.0|null|   4.0|
## +----+----+------+

See also: Pyspark: multiple conditions in when clause

Sign up to request clarification or add additional context in comments.

1 Comment

Thanks! That helped a lot. The important thing is to enclose the individual conditions in parantheses ( col('foo') == 'bar' )
0
from pyspark.sql.functions import col, countDistinct 
totalrecordcount = df.where("ColumnName is not null").select(countDistinct("ColumnName")).collect()[0][0]

1 Comment

Welcome to the community. please consider editing your answer in order to explain how your code helps to fix the current issue.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.