Slow postgres query when joining large tables

Question

I have query that's performing pretty slowly. I believe that the issue is that I'm joining across several large tables, but I still would have expected better performance. Query and EXPLAIN ANALYZE below:

SELECT
    "m_advertsnapshot"."id",
    "m_advertsnapshot"."created",
    "m_advertsnapshot"."modified",
    "m_advertsnapshot"."snapshot_timestamp",
    "m_advertsnapshot"."source_name",
    COUNT(CASE m_advert.widget_listing_id IS NULL and m_advert.height IS NULL WHEN True THEN 1 ELSE null END) AS "adh_count_with_no_wl_and_missing_height",
    COUNT(CASE m_advert.widget_listing_id IS NULL and m_advert.height IS NOT NULL and m_advert.colour_id IS NOT NULL and m_advert.ctype IS NOT NULL WHEN True THEN 1 ELSE null END) AS "adh_count_with_no_wl_and_has_height_plate_ctype",
    COUNT(CASE m_advert.widget_listing_id IS NULL and m_advert.height IS NULL and m_advert.colour_id is NULL and m_advert.ctype is NULL  WHEN True THEN 1 ELSE null END) AS "adh_count_with_no_wl_and_missing_height_and_missing_plate_c268",
    COUNT("m_adverthistory"."id") AS "adh_count",
    COUNT(CASE m_advert.widget_listing_id IS NULL and m_advert.height IS NULL and m_advert.colour_id is NULL WHEN True THEN 1 ELSE null END) AS "adh_count_with_no_wl_and_missing_height_and_missing_plate",
    COUNT("m_advert"."widget_listing_id") AS "adh_count_with_wl"
FROM "m_advertsnapshot"
    LEFT OUTER JOIN "m_adverthistory" ON ("m_advertsnapshot"."id" = "m_adverthistory"."advert_snapshot_id")
    LEFT OUTER JOIN "m_advert" ON ("m_adverthistory"."advert_id" = "m_advert"."id")
GROUP BY
    "m_advertsnapshot"."id",
    "m_advertsnapshot"."created",
    "m_advertsnapshot"."modified",
    "m_advertsnapshot"."snapshot_timestamp",
    "m_advertsnapshot"."source_name"
ORDER BY
    "m_advertsnapshot"."snapshot_timestamp" DESC



"Sort  (cost=796180.41..796180.90 rows=196 width=72) (actual time=18051.504..18051.519 rows=196 loops=1)"
"  Sort Key: m_advertsnapshot.snapshot_timestamp"
"  Sort Method: quicksort  Memory: 60kB"
"  ->  HashAggregate  (cost=796170.99..796172.95 rows=196 width=72) (actual time=18051.330..18051.396 rows=196 loops=1)"
"        ->  Hash Right Join  (cost=227052.68..622950.33 rows=6298933 width=72) (actual time=2082.551..12166.226 rows=6298933 loops=1)"
"              Hash Cond: (m_adverthistory.advert_snapshot_id = m_advertsnapshot.id)"
"              ->  Hash Left Join  (cost=227045.27..536332.59 rows=6298933 width=24) (actual time=2082.483..9971.996 rows=6298933 loops=1)"
"                    Hash Cond: (m_adverthistory.advert_id = m_advert.id)"
"                    ->  Seq Scan on m_adverthistory  (cost=0.00..121858.33 rows=6298933 width=12) (actual time=0.003..1644.060 rows=6298933 loops=1)"
"                    ->  Hash  (cost=202575.12..202575.12 rows=1332812 width=20) (actual time=2080.897..2080.897 rows=1332812 loops=1)"
"                          Buckets: 2048  Batches: 128  Memory Usage: 525kB"
"                          ->  Seq Scan on m_advert  (cost=0.00..202575.12 rows=1332812 width=20) (actual time=0.007..1564.220 rows=1332812 loops=1)"
"              ->  Hash  (cost=4.96..4.96 rows=196 width=52) (actual time=0.062..0.062 rows=196 loops=1)"
"                    Buckets: 1024  Batches: 1  Memory Usage: 17kB"
"                    ->  Seq Scan on m_advertsnapshot  (cost=0.00..4.96 rows=196 width=52) (actual time=0.004..0.030 rows=196 loops=1)"
"Total runtime: 18051.730 ms"

The query is taking 18 seconds using postgres 9.2. The table sizes are:

m_advertsnapshot - 196 rows
m_adverthistory - 6,298,933 rows
m_advert - 1,332,812 rows

DDLs:

-- m_advertsnapshot

CREATE TABLE m_advertsnapshot
(
  id serial NOT NULL,
  snapshot_timestamp timestamp with time zone NOT NULL,
  source_name character varying(50),
  CONSTRAINT m_advertsnapshot_pkey PRIMARY KEY (id),
  CONSTRAINT m_advertsnapshot_source_name_6a9a437077520191_uniq UNIQUE (source_name, snapshot_timestamp)
)
WITH (
  OIDS=FALSE
);

CREATE INDEX m_advertsnapshot_snapshot_timestamp
  ON m_advertsnapshot
  USING btree
  (snapshot_timestamp);

-- m_adverthistory

CREATE TABLE m_adverthistory
(
  id serial NOT NULL,
  advert_id integer NOT NULL,
  advert_snapshot_id integer NOT NULL,
  observed_timestamp timestamp with time zone NOT NULL,
  CONSTRAINT m_adverthistory_pkey PRIMARY KEY (id),
  CONSTRAINT advert_id_refs_id_30735d9eef85241c FOREIGN KEY (advert_id)
      REFERENCES m_advert (id) MATCH SIMPLE
      ON UPDATE NO ACTION ON DELETE NO ACTION DEFERRABLE INITIALLY DEFERRED,
  CONSTRAINT advert_snapshot_id_refs_id_55d3986f4f270624 FOREIGN KEY (advert_snapshot_id)
      REFERENCES m_advertsnapshot (id) MATCH SIMPLE
      ON UPDATE NO ACTION ON DELETE NO ACTION DEFERRABLE INITIALLY DEFERRED,
  CONSTRAINT m_adverthistory_advert_id_13fa0dae39e78983_uniq UNIQUE (advert_id, advert_snapshot_id)
)
WITH (
  OIDS=FALSE
);

CREATE INDEX m_adverthistory_advert_id
  ON m_adverthistory
  USING btree
  (advert_id);

CREATE INDEX m_adverthistory_advert_snapshot_id
  ON m_adverthistory
  USING btree
  (advert_snapshot_id);

-- m_advert

CREATE TABLE m_advert
(
  id serial NOT NULL,
  widget_listing_id integer,
  height integer,
  ctype integer,
  colour_id integer,
  CONSTRAINT m_advert_pkey PRIMARY KEY (id),
  CONSTRAINT "colour_id_refs_id_1e4e2dac0183b419" FOREIGN KEY (colour_id)
      REFERENCES colour ("id") MATCH SIMPLE
      ON UPDATE NO ACTION ON DELETE NO ACTION DEFERRABLE INITIALLY DEFERRED,
  CONSTRAINT widget_listing_id_refs_id_5a7e62d0d4f48013 FOREIGN KEY (widget_listing_id)
      REFERENCES m_widgetlisting (id) MATCH SIMPLE
      ON UPDATE NO ACTION ON DELETE NO ACTION DEFERRABLE INITIALLY DEFERRED,

)
WITH (
  OIDS=FALSE
);

CREATE INDEX m_advert_advert_seller_id
  ON m_advert
  USING btree
  (advert_seller_id);

CREATE INDEX m_advert_colour_id
  ON m_advert
  USING btree
  (colour_id);

CREATE INDEX m_advert_widget_listing_id
  ON m_advert
  USING btree
  (widget_listing_id);

Any ideas on how to improve the performance of this would be appreciated.

Thanks!

To me it appears that id (or advert_id} should at least be part of the primary key of {m_advert, m_advertsnapshot} Do you have any primary keys or foreign keys in your schema? Please show us the DDL's — wildplasser
– wildplasser, Commented Mar 27, 2013 at 17:24
I added the DDLs. The joins are on primary/ foreign keys. These were generated by Django (though I don't think that makes a difference) — alan
– alan, Commented Mar 27, 2013 at 18:16
The schema looks reasonable (for the query you don't actually need the indexes, and some of the indexes are already covered by the FK constraints) The Junction table does not need a surrogate (but it won't harm). The real reason for your query being slow is that it needs all the rows from all the tables to compute the aggregates. If you need 100% of the data indexes cannot help very much. Adding an additional constraint (eg on snapshot_timestamp >= some_date) will probably cause a different plan that will use the indexes. — wildplasser
– wildplasser, Commented Mar 27, 2013 at 18:30
You might be able to get a boost by bumping work_mem up for this query, giving it more space to play with for its hashing and sorting. Try SET work_mem = '50MB' before the query and see if the plan or performance changes. Do not set this in postgresql.conf. — Craig Ringer
– Craig Ringer, Commented Mar 28, 2013 at 0:42
Have you tried changing the two indexes on History so that they havbe both fields: advert_id and advert_snapshot_id ? Having two indexes, with both fields (advert_id, advert_snapshot_id) and (advert_snapshot_id, advert_id) might help, since the second key could be picked up from the index itself. — Darius X.
– Darius X., Commented Mar 28, 2013 at 14:13

wildplasser · Accepted Answer · 2013-04-25 18:46:52Z

2

The schema looks reasonable (for the query you don't actually need the indexes, and some of the indexes are already covered by the FK constraints)
The Junction table does not need a surrogate key (but it won't harm).
The real reason for your query being slow is that it needs all the rows from all the tables to compute the aggregates. If you need 100% of the data, indexes cannot help very much.
Adding an additional constraint (eg on snapshot_timestamp >= some_date) will probably cause a different plan that will use the indexes.

answered Apr 25, 2013 at 18:46

wildplasser

44.5k9 gold badges72 silver badges116 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Slow postgres query when joining large tables

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related