Why is Postgres using a sequential scan in my JOIN clause?

Question

Using PG 9.5, I have a query that joins the rubber table's FK column to the fuzzy table's primary id column. Both columns are indexed with a standard btree index. The rubber table has over 230MM rows, fuzzy has over 25MM. When I do a join on these tables and apply a constraint a column in fuzzy, PG keeps using a sequential scan in the join, and the query takes about 2 minutes.

SELECT * FROM rubber r 
JOIN fuzzy fp ON fp.id = r.fuzzy_id
WHERE fp.bean_num IN (73470871);

I've narrowed it down to the join being the sequential, slow part of the query. Namely, the following is very fast, and uses the index:

SELECT * FROM rubber WHERE fuzzy_id = 12345

But when I try something like this, it's just as slow as the JOIN query above:

SELECT * FROM rubber WHERE fuzzy_id IN (
    SELECT id FROM fuzzy WHERE bean_num IN (73470871)
);

I'm suspecting that it has to do with the query planner not being able (deciding not?) to use the index when trying to match on some set of foreign keys. The foreign keys are not unique, but not highly duplicated, and none are set to null, so I couldn't take advantage of something like a partial index.

table definitions:

-- 231MM rows
CREATE TABLE rubber (
    id bigint DEFAULT nextval('rubber_id_seq1'::regclass) PRIMARY KEY,
    context_id integer NOT NULL REFERENCES context(id) ON DELETE CASCADE,
    fuzzy_id integer REFERENCES fuzzy(id),
);

CREATE UNIQUE INDEX rubber_pkey1 ON rubber(id int8_ops);
CREATE INDEX rubber_context_id_idx1 ON rubber(context_id int4_ops);
CREATE INDEX rubber_fingerprint_id_idx1 ON rubber(fingerprint_id int4_ops);
CREATE INDEX rubber_conclusion_id_idx1 ON rubber(conclusion_id int4_ops);
CREATE UNIQUE INDEX rubber_id_idx ON rubber(id int8_ops);
CREATE INDEX rubber_fuzzy_id_idx1 ON rubber(fuzzy_id int4_ops);

-- 26.5MM rows
CREATE TABLE fuzzy (
    id SERIAL PRIMARY KEY,
    trip_id integer NOT NULL REFERENCES trip(id),
    device_id integer NOT NULL REFERENCES device(id),
    chirp_vision_id integer NOT NULL REFERENCES chirp_vision(id),
    mode_id integer NOT NULL REFERENCES mode(id),
    fig_id integer NOT NULL REFERENCES fig(id),
    gist_id integer NOT NULL REFERENCES gist(id),
    bean_num integer REFERENCES bean_num(id),
    key_path jsonb NOT NULL,
    CONSTRAINT fingerprint_tuple UNIQUE (chirp_vision_id, gist_id, key_path, trip_id, fig_id, device_id, mode_id)
);

CREATE UNIQUE INDEX fuzzy_pkey ON fuzzy(id int4_ops);
CREATE INDEX fuzzy_fig_id_idx ON fuzzy(fig_id int4_ops);
CREATE INDEX fuzzy_gist_id_idx ON fuzzy(gist_id int4_ops);
CREATE INDEX fuzzy_bean_num_idx ON fuzzy(bean_num int4_ops);
CREATE UNIQUE INDEX fingerprint_tuple ON fuzzy(chirp_vision_id int4_ops,gist_id int4_ops,key_path jsonb_ops,trip_id int4_ops,fig_id int4_ops,device_id int4_ops,mode_id int4_ops);

`EXPLAIN (BUFFERS,ANALYZE)`:

"QUERY PLAN"
"Hash Join  (cost=5288.99..6339911.22 rows=15277 width=189) (actual time=82319.995..136625.784 rows=483 loops=1)"
"  Hash Cond: (r.fuzzy_id = fp.id)"
"  Buffers: shared hit=599 read=3151247"
"  ->  Seq Scan on rubber r  (cost=0.00..5466479.88 rows=231463888 width=80) (actual time=0.078..117561.885 rows=231463887 loops=1)"
"        Buffers: shared hit=597 read=3151244"
"  ->  Hash  (cost=5267.11..5267.11 rows=1750 width=109) (actual time=2.251..2.251 rows=23 loops=1)"
"        Buckets: 2048  Batches: 1  Memory Usage: 20kB"
"        Buffers: shared hit=2 read=3"
"        ->  Index Scan using fuzzy_bean_num_idx on fuzzy fp  (cost=0.44..5267.11 rows=1750 width=109) (actual time=2.220..2.244 rows=23 loops=1)"
"              Index Cond: (bean_num = 73470871)"
"              Buffers: shared hit=2 read=3"
"Planning time: 0.382 ms"
"Execution time: 136625.875 ms"

Is there a way to get better performance out of a query like this?

There is also an interesting comment in a dba stack exchange comment, suggesting that an index on (fuzzy_id, bean_num) would help, but I don't understand how that would help.

UPDATE: I've migrated to PG 12.3 and this query runs in a couple hundred milliseconds now.

Please include the result of running EXPLAIN (BUFFERS, ANALYZE) on your query. — Blue Star
– Blue Star, Commented Feb 3, 2021 at 2:43
Show us the table definitions You need to show us the table and index definitions, as well as row counts for each of the tables. Maybe your tables are defined poorly. Maybe the indexes aren't created correctly. Maybe you don't have an index on that column you thought you did. Without seeing the table and index definitions, we can't tell. We need row counts because that can affect query planning. If you know how to do an EXPLAIN or get an execution plan, put the results in the question as well. — Andy Lester
– Andy Lester, Commented Feb 3, 2021 at 2:43
I'm honestly not sure why postgres is choosing that plan, but here are two small suggestions: try running ANALYZE rubber to update the statistics, that might do the trick. If not, one workaround would be performing a lateral join on rubber instead of a regular one, so that it would use the index for each row of fuzzy. — Blue Star
– Blue Star, Commented Feb 3, 2021 at 4:13

BrDaHa · Accepted Answer · 2021-02-03 20:10:01Z

2

Question: Why did you create 2 (almost) identical indexes on rubber.id:

CREATE UNIQUE INDEX rubber_pkey1 ON rubber(id int8_ops);
CREATE UNIQUE INDEX rubber_id_idx ON rubber(id int8_ops);

Advice: DROP INDEX rubber_id_idx;

An index that might be very useful for the JOIN, to give the planner better information about the relation between these tables, is this one:

CREATE INDEX fuzzy_bean_num_idx_2 ON fuzzy(bean_num, id);

You might need a different (better) setting for the number of statistics as well. Maybe for just one table, maybe both, maybe the entire system.

Edit: After changing the settings for the statistics, you have to run ANALYZE for these tables to update the statistics.

Offtopic: Version 9.5 is old and will be EOL within the next few months. Newer versions do behave different and might also solve this performance problem.

edited Feb 3, 2021 at 20:10

BrDaHa

5,9106 gold badges36 silver badges53 bronze badges

answered Feb 3, 2021 at 9:38

Frank Heikens

129k26 gold badges157 silver badges153 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

user330315 Over a year ago

"Version 9.5 is old and will be EOL within the next few months" - actually in 8 days

BrDaHa Over a year ago

1. um.. not sure how that got there, thanks for catching that! 2. Yes, we have a new version (12) we're getting ready to migrate to. Sounds like a better use of time would be to migrate instead of trying to optimize

Collectives™ on Stack Overflow

Why is Postgres using a sequential scan in my JOIN clause?

table definitions:

`EXPLAIN (BUFFERS,ANALYZE)`:

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

table definitions:

EXPLAIN (BUFFERS,ANALYZE):

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related

`EXPLAIN (BUFFERS,ANALYZE)`: