After particular time on solving task and reviewing Snowflake documentation I noticed potential improvement in query for readability and possibly performance improvement. My query is using nested correlated query to check it there is any existing updates for main table using separate table with changes. Both tables don't have explicit PK or any other constraints on allowed values. Here is an example of simplified query:
SELECT a.*
FROM tableA a
WHERE EXISTS (
SELECT 1
FROM tableA_CDC a_cdc
WHERE a.column1 = a_cdc.column1
AND a.column2 = a_cdc.column2
AND (a.column3 = a_cdc.column3 OR (a.column3 IS NULL AND a_cdc.column3 IS NULL))
)
I was interested in the last predicate (a.column3 = a_cdc.column3 OR (a.column3 IS NULL AND a_cdc.column3 IS NULL)). For column3 value can be null, so we want to fetch rows from main table. Column1 and column2 cannot have null values and we can ignore null handling.
The problem I found was not only in readability, but as I noticed in performance. Basically if we compare only by '=' or checking if both columns are NULLs - everything works fine (using query profile). The sum of counts of data by each predicate gives correct result. But if we have grouped condition on equals or is null, then we have correct changes count, but query profile shows that full table scan was performed.
In documentation I found function called 'EQUAL_NULL', that allow null-safely compare two expressions. If I modify query by replacing last grouped predicate with EQUAL_NULL, then result is correct and there is no full table scan.
SELECT a.*
FROM tableA a
WHERE EXISTS (
SELECT 1
FROM tableA_CDC a_cdc
WHERE a.column1 = a_cdc.column1
AND a.column2 = a_cdc.column2
AND EQUAL_NULL(a.column3, a_cdc.column3)
)
Any ideas why we have full table scan in first case?