PostgreSQL: Create an index to quickly distinguish NULL from non-NULL values

Question

Consider a SQL query with the following WHERE predicate:

...
WHERE name IS NOT NULL
...

Where name is a textual field in PostgreSQL.

No other query checks any textual property of this value, just whether it is NULL or not. Therefore, a full btree index seems like an overkill, even though it supports this distinction:

Also, an IS NULL or IS NOT NULL condition on an index column can be used with a B-tree index.

What's the right PostgreSQL index to quickly distinguish NULLs from non-NULLs?

You can add you predicate to the create index to at least minimize its size. — Uwe Allner
– Uwe Allner, Commented Aug 12, 2015 at 13:12
As you don't have another choice than btree, gist, gin and hash, which is discouraged, I don't see another way possible. — Uwe Allner
– Uwe Allner, Commented Aug 12, 2015 at 13:17
you can create index i on t (coalesce('NULL',col)); to actually index NULL and avoid separating one nulls from other nulls — Vao Tsun
– Vao Tsun, Commented Aug 12, 2015 at 13:24
@VaoTsun o_O? NULL is indexed. Where do you get the idea that it isn't? — Craig Ringer
– Craig Ringer, Commented Aug 12, 2015 at 13:28
@VaoTsun The b-tree for NULL contains row pointers to each row with NULL in that field. While they unequal, they all match the predicate IS NULL. Your suggestion doesn't make sense, and not only that, Pg could only then actually use the index if the query also used the same expression. — Craig Ringer
– Craig Ringer, Commented Aug 12, 2015 at 13:44

Community · Accepted Answer · 2017-05-23 12:17:58Z

88

I'm interpreting you claim that it's "overkill" in two ways: in terms of complexity (using a B-Tree instead of just a list) and space/performance.

For complexity, it's not overkill. A B-Tree index is preferable because deletes from it will be faster than some kind of "unordered" index (for lack of a better term). (An unordered index would require a full index scan just to delete.) In light of that fact, any gains from an unordered index would be usually be outweighed by the detriments, so the development effort isn't justified.

For space and performance, though, if you want a highly selective index for efficiency, you can include a WHERE clause on an index, as noted in the fine manual:

CREATE INDEX ON my_table (name) WHERE name IS NOT NULL;

Note that you'll only see benefits from this index if it can allow PostgreSQL to ignore a large amount of rows when executing your query. E.g., if 99% of the rows have name IS NOT NULL, the index isn't buying you anything over just letting a full table scan happen; in fact, it would be less efficient (as @CraigRinger notes) since it would require extra disk reads. If however, only 1% of rows have name IS NOT NULL, then this represents huge savings as PostgreSQL can ignore most of the table for your query. If your table is very large, even eliminating 50% of the rows might be worth it. This is a tuning problem, and whether the index is valuable is going to depend heavily on the size and distribution of the data.

Additionally, there is very little gain in terms of space if you still need another index for the name IS NULL rows. See Craig Ringer's answer for details.

edited May 23, 2017 at 12:17

CommunityBot

11 silver badge

answered Aug 12, 2015 at 13:26

jpmc26

30.2k14 gold badges100 silver badges152 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Craig Ringer Over a year ago

Well explained. In fact, using the index for something that'd match 99% of rows would be massively inefficient, much slower than seqscan.

Craig Ringer · Accepted Answer · 2015-08-12 13:33:18Z

You could use an expression index, but you shouldn't. Keep it simple, and use a plain b-tree.

An expression index can be created on colname IS NOT NULL:

test=> CREATE TABLE blah(name text);
CREATE TABLE
test=> CREATE INDEX name_notnull ON blah((name IS NOT NULL));
CREATE INDEX
test=> INSERT INTO blah(name) VALUES ('a'),('b'),(NULL);
INSERT 0 3
test=> SET enable_seqscan = off;
SET
craig=> SELECT * FROM blah WHERE name IS NOT NULL;
 name 
------
 a
 b
(2 rows)

test=> EXPLAIN SELECT * FROM blah WHERE name IS NOT NULL;
                                 QUERY PLAN                                  
-----------------------------------------------------------------------------
 Bitmap Heap Scan on blah  (cost=9.39..25.94 rows=1303 width=32)
   Filter: (name IS NOT NULL)
   ->  Bitmap Index Scan on name_notnull  (cost=0.00..9.06 rows=655 width=0)
         Index Cond: ((name IS NOT NULL) = true)
(4 rows)

test=> SET enable_bitmapscan = off;
SET
test=> EXPLAIN SELECT * FROM blah WHERE name IS NOT NULL;
                                  QUERY PLAN                                  
------------------------------------------------------------------------------
 Index Scan using name_notnull on blah  (cost=0.15..55.62 rows=1303 width=32)
   Index Cond: ((name IS NOT NULL) = true)
   Filter: (name IS NOT NULL)
(3 rows)

... but Pg doesn't realise that it's also usable for IS NULL:

test=> EXPLAIN SELECT * FROM blah WHERE name IS NULL;
                               QUERY PLAN                                
-------------------------------------------------------------------------
 Seq Scan on blah  (cost=10000000000.00..10000000023.10 rows=7 width=32)
   Filter: (name IS NULL)
(2 rows)

and even transforms NOT (name IS NOT NULL) into name IS NULL, which is usually what you want.

test=> EXPLAIN SELECT * FROM blah WHERE NOT (name IS NOT NULL);
                               QUERY PLAN                                
-------------------------------------------------------------------------
 Seq Scan on blah  (cost=10000000000.00..10000000023.10 rows=7 width=32)
   Filter: (name IS NULL)
(2 rows)

so you're actually better off with two disjoint expression indexes, one on the null and one on the non-null set.

test=> DROP INDEX name_notnull ;
DROP INDEX
test=> CREATE INDEX name_notnull ON blah((name IS NOT NULL)) WHERE (name IS NOT NULL);
CREATE INDEX
test=> EXPLAIN SELECT * FROM blah WHERE name IS NOT NULL;
                                QUERY PLAN                                
--------------------------------------------------------------------------
 Index Scan using name_notnull on blah  (cost=0.13..8.14 rows=3 width=32)
   Index Cond: ((name IS NOT NULL) = true)
(2 rows)

test=> CREATE INDEX name_null ON blah((name IS NULL)) WHERE (name IS NULL);
CREATE INDEX
craig=> EXPLAIN SELECT * FROM blah WHERE name IS NULL;
                              QUERY PLAN                               
-----------------------------------------------------------------------
 Index Scan using name_null on blah  (cost=0.12..8.14 rows=1 width=32)
   Index Cond: ((name IS NULL) = true)
(2 rows)

This is pretty gruesome though. For most sensible uses I'd just use a plain b-tree index. The index size improvement isn't too exciting, at least for small-ish inputs, like the dummy I created with a bunch of md5 values:

test=> SELECT pg_size_pretty(pg_relation_size('blah'));
 pg_size_pretty 
----------------
 9416 kB
(1 row)

test=> SELECT pg_size_pretty(pg_relation_size('blah_name'));
 pg_size_pretty 
----------------
 7984 kB
(1 row)

test=> SELECT pg_size_pretty(pg_relation_size('name_notnull'));
 pg_size_pretty 
----------------
 2208 kB
(1 row)

test=> SELECT pg_size_pretty(pg_relation_size('name_null'));
 pg_size_pretty 
----------------
 2208 kB
(1 row)

fxtentacle · Accepted Answer · 2017-09-21 14:29:09Z

14

You can use an expression like (title IS NULL) as the indexed column. So this works as expected:

CREATE INDEX index_articles_on_title_null ON articles ( (title IS NULL) );
SELECT * FROM articles WHERE (title IS NULL)='t';

This has the big advantage over using a predicate that in this case the value stored in the index is only a yes/no boolean and not the full column value. So especially if your NULL-checked column tends to contain large values (like a title text field here), then this way of indexing is much more space-efficient than using a predicated index.

answered Sep 21, 2017 at 14:29

fxtentacle

81311 silver badges15 bronze badges

1 Comment

user330315 Over a year ago

You don't need the = 't'. Using WHERE title IS NULL will work just fine

Collectives™ on Stack Overflow

PostgreSQL: Create an index to quickly distinguish NULL from non-NULL values

3 Answers 3

1 Comment

Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related