94

Consider a SQL query with the following WHERE predicate:

...
WHERE name IS NOT NULL
...

Where name is a textual field in PostgreSQL.

No other query checks any textual property of this value, just whether it is NULL or not. Therefore, a full btree index seems like an overkill, even though it supports this distinction:

Also, an IS NULL or IS NOT NULL condition on an index column can be used with a B-tree index.

What's the right PostgreSQL index to quickly distinguish NULLs from non-NULLs?

7
  • You can add you predicate to the create index to at least minimize its size. Commented Aug 12, 2015 at 13:12
  • As you don't have another choice than btree, gist, gin and hash, which is discouraged, I don't see another way possible. Commented Aug 12, 2015 at 13:17
  • you can create index i on t (coalesce('NULL',col)); to actually index NULL and avoid separating one nulls from other nulls Commented Aug 12, 2015 at 13:24
  • @VaoTsun o_O? NULL is indexed. Where do you get the idea that it isn't? Commented Aug 12, 2015 at 13:28
  • 1
    @VaoTsun The b-tree for NULL contains row pointers to each row with NULL in that field. While they unequal, they all match the predicate IS NULL. Your suggestion doesn't make sense, and not only that, Pg could only then actually use the index if the query also used the same expression. Commented Aug 12, 2015 at 13:44

3 Answers 3

88

I'm interpreting you claim that it's "overkill" in two ways: in terms of complexity (using a B-Tree instead of just a list) and space/performance.

For complexity, it's not overkill. A B-Tree index is preferable because deletes from it will be faster than some kind of "unordered" index (for lack of a better term). (An unordered index would require a full index scan just to delete.) In light of that fact, any gains from an unordered index would be usually be outweighed by the detriments, so the development effort isn't justified.

For space and performance, though, if you want a highly selective index for efficiency, you can include a WHERE clause on an index, as noted in the fine manual:

CREATE INDEX ON my_table (name) WHERE name IS NOT NULL;

Note that you'll only see benefits from this index if it can allow PostgreSQL to ignore a large amount of rows when executing your query. E.g., if 99% of the rows have name IS NOT NULL, the index isn't buying you anything over just letting a full table scan happen; in fact, it would be less efficient (as @CraigRinger notes) since it would require extra disk reads. If however, only 1% of rows have name IS NOT NULL, then this represents huge savings as PostgreSQL can ignore most of the table for your query. If your table is very large, even eliminating 50% of the rows might be worth it. This is a tuning problem, and whether the index is valuable is going to depend heavily on the size and distribution of the data.

Additionally, there is very little gain in terms of space if you still need another index for the name IS NULL rows. See Craig Ringer's answer for details.

Sign up to request clarification or add additional context in comments.

1 Comment

Well explained. In fact, using the index for something that'd match 99% of rows would be massively inefficient, much slower than seqscan.
41

You could use an expression index, but you shouldn't. Keep it simple, and use a plain b-tree.


An expression index can be created on colname IS NOT NULL:

test=> CREATE TABLE blah(name text);
CREATE TABLE
test=> CREATE INDEX name_notnull ON blah((name IS NOT NULL));
CREATE INDEX
test=> INSERT INTO blah(name) VALUES ('a'),('b'),(NULL);
INSERT 0 3
test=> SET enable_seqscan = off;
SET
craig=> SELECT * FROM blah WHERE name IS NOT NULL;
 name 
------
 a
 b
(2 rows)

test=> EXPLAIN SELECT * FROM blah WHERE name IS NOT NULL;
                                 QUERY PLAN                                  
-----------------------------------------------------------------------------
 Bitmap Heap Scan on blah  (cost=9.39..25.94 rows=1303 width=32)
   Filter: (name IS NOT NULL)
   ->  Bitmap Index Scan on name_notnull  (cost=0.00..9.06 rows=655 width=0)
         Index Cond: ((name IS NOT NULL) = true)
(4 rows)

test=> SET enable_bitmapscan = off;
SET
test=> EXPLAIN SELECT * FROM blah WHERE name IS NOT NULL;
                                  QUERY PLAN                                  
------------------------------------------------------------------------------
 Index Scan using name_notnull on blah  (cost=0.15..55.62 rows=1303 width=32)
   Index Cond: ((name IS NOT NULL) = true)
   Filter: (name IS NOT NULL)
(3 rows)

... but Pg doesn't realise that it's also usable for IS NULL:

test=> EXPLAIN SELECT * FROM blah WHERE name IS NULL;
                               QUERY PLAN                                
-------------------------------------------------------------------------
 Seq Scan on blah  (cost=10000000000.00..10000000023.10 rows=7 width=32)
   Filter: (name IS NULL)
(2 rows)

and even transforms NOT (name IS NOT NULL) into name IS NULL, which is usually what you want.

test=> EXPLAIN SELECT * FROM blah WHERE NOT (name IS NOT NULL);
                               QUERY PLAN                                
-------------------------------------------------------------------------
 Seq Scan on blah  (cost=10000000000.00..10000000023.10 rows=7 width=32)
   Filter: (name IS NULL)
(2 rows)

so you're actually better off with two disjoint expression indexes, one on the null and one on the non-null set.

test=> DROP INDEX name_notnull ;
DROP INDEX
test=> CREATE INDEX name_notnull ON blah((name IS NOT NULL)) WHERE (name IS NOT NULL);
CREATE INDEX
test=> EXPLAIN SELECT * FROM blah WHERE name IS NOT NULL;
                                QUERY PLAN                                
--------------------------------------------------------------------------
 Index Scan using name_notnull on blah  (cost=0.13..8.14 rows=3 width=32)
   Index Cond: ((name IS NOT NULL) = true)
(2 rows)

test=> CREATE INDEX name_null ON blah((name IS NULL)) WHERE (name IS NULL);
CREATE INDEX
craig=> EXPLAIN SELECT * FROM blah WHERE name IS NULL;
                              QUERY PLAN                               
-----------------------------------------------------------------------
 Index Scan using name_null on blah  (cost=0.12..8.14 rows=1 width=32)
   Index Cond: ((name IS NULL) = true)
(2 rows)

This is pretty gruesome though. For most sensible uses I'd just use a plain b-tree index. The index size improvement isn't too exciting, at least for small-ish inputs, like the dummy I created with a bunch of md5 values:

test=> SELECT pg_size_pretty(pg_relation_size('blah'));
 pg_size_pretty 
----------------
 9416 kB
(1 row)

test=> SELECT pg_size_pretty(pg_relation_size('blah_name'));
 pg_size_pretty 
----------------
 7984 kB
(1 row)

test=> SELECT pg_size_pretty(pg_relation_size('name_notnull'));
 pg_size_pretty 
----------------
 2208 kB
(1 row)

test=> SELECT pg_size_pretty(pg_relation_size('name_null'));
 pg_size_pretty 
----------------
 2208 kB
(1 row)

Comments

14

You can use an expression like (title IS NULL) as the indexed column. So this works as expected:

CREATE INDEX index_articles_on_title_null ON articles ( (title IS NULL) );
SELECT * FROM articles WHERE (title IS NULL)='t';

This has the big advantage over using a predicate that in this case the value stored in the index is only a yes/no boolean and not the full column value. So especially if your NULL-checked column tends to contain large values (like a title text field here), then this way of indexing is much more space-efficient than using a predicated index.

1 Comment

You don't need the = 't'. Using WHERE title IS NULL will work just fine

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.