Optimizing a Rails/Postgres many to many query

Question

I have a many to many relationship which looks like this:

class Feed
  has_many :cross_listed_papers, through: :cross_lists, source: :paper
end

Frequently (i.e. on page load), I will need to look up, sort and paginate the cross listed papers for a set of feeds over a given time period. However, this can be very slow:

[107] pry(main)> Paper.includes(:cross_lists).where(cross_lists: { feed_id: [311,312,313,314] }).where("pubdate >= ? and pubdate <= ?", Time.now - 300.days, Time.now).count

D, [2013-12-17T21:56:43.640283 #19404] DEBUG -- : (228.9ms) SELECT COUNT(DISTINCT "papers"."id") FROM "papers" LEFT OUTER JOIN "cross_lists" ON "cross_lists"."paper_id" = "papers"."id" WHERE "cross_lists"."feed_id" IN (311, 312, 313, 314) AND (pubdate >= '2013-02-20 10:56:43.404072' and pubdate <= '2013-12-17 10:56:43.404234') => 2811

228.9 ms is not ideal for something that may end up happening every page load, particularly since this inflates rapidly if I try to join more data in (even for less extensive time ranges). Here's the EXPLAIN ANALYZE:

Aggregate  (cost=110771.10..110771.11 rows=1 width=4) (actual time=243.826..243.826 rows=1 loops=1)
  ->  Hash Join  (cost=95343.72..110749.09 rows=8807 width=4) (actual time=93.725..242.725 rows=2830 loops=1)
        Hash Cond: (cross_lists.paper_id = papers.id)
        ->  Bitmap Heap Scan on cross_lists  (cost=2876.53..15182.11 rows=158372 width=4) (actual time=15.496..90.232 rows=162981 loops=1)
              Recheck Cond: (feed_id = ANY ('{311,312,313,314}'::integer[]))
              ->  Bitmap Index Scan on index_cross_lists_on_feed_id_and_cross_list_date  (cost=0.00..2836.94 rows=158372 width=0) (actual time=14.383..14.383 rows=162981 loops=1)
                    Index Cond: (feed_id = ANY ('{311,312,313,314}'::integer[]))
        ->  Hash  (cost=91670.95..91670.95 rows=48499 width=4) (actual time=76.079..76.079 rows=48853 loops=1)
              Buckets: 4096  Batches: 2  Memory Usage: 861kB
              ->  Bitmap Heap Scan on papers  (cost=1033.46..91670.95 rows=48499 width=4) (actual time=6.495..61.230 rows=48853 loops=1)
                    Recheck Cond: ((pubdate >= '2013-02-20'::date) AND (pubdate <= '2013-12-17'::date))
                    ->  Bitmap Index Scan on index_papers_on_pubdate  (cost=0.00..1021.34 rows=48499 width=0) (actual time=5.437..5.437 rows=48855 loops=1)
                          Index Cond: ((pubdate >= '2013-02-20'::date) AND (pubdate <= '2013-12-17'::date))
Total runtime: 244.295 ms

Are there indexes I can use to speed up this kind of query, or do I need to resort to denormalization?

The query you posted and explain analyzed doesn't seem to correspond to your ruby code, or the ruby code might not be doing what you're hoping: there's no sort operator in the plan. That's valid since you're counting, but it hints at an error in your logic. — Denis de Bernardy
– Denis de Bernardy, Commented Dec 17, 2013 at 11:17

Denis de Bernardy · Accepted Answer · 2013-12-17 11:29:54Z

1

The query plan seems sane. As you can see from the choices done by the planner, you're already using indexes the best you can, i.e. to do bitmap index scans.

What's taking time is joining 49k rows with 163k rows. There's little you can do about it, seeing how neither of your criteria can be pre-aggregated.

Question the rational of actually running the query frequently. I'm guessing this is run to calculate the total number of pages? If so, couldn't you cache it until that number changes? (If not, perhaps post more information and explain what the query is trying to achieve in plain english.)

answered Dec 17, 2013 at 11:29

Denis de Bernardy

79.1k14 gold badges138 silver badges158 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Mispy Over a year ago

I was afraid of that. Looks like aggressive caching is the way to go. Will accept answer if nobody comes up with anything else

Community · Accepted Answer · 2020-06-20 09:12:55Z

0

I did end up deciding to denormalize very slightly. By moving pubdate on to the cross_lists table instead, I get an ~10x speedup:

[185] pry(main)> Paper.joins(:cross_lists).where("cross_lists.feed_id IN (?) AND cross_lists.cross_list_date >= ? AND cross_lists.cross_list_date <= ?", feed_ids, Time.now - 300.days, Time.now).count

(22.4ms) SELECT COUNT(*) FROM "papers" INNER JOIN "cross_lists" ON "cross_lists"."paper_id" = "papers"."id" WHERE (cross_lists.feed_id IN (311,312,313,314) AND cross_lists.cross_list_date >= '2013-02-20 14:33:29.034243' AND cross_lists.cross_list_date <= '2013-12-17 14:33:29.034443') => 2830

I can then paginate on the result of this query and join in the other data post-limit, which vastly reduces the impact of large result sets.

edited Jun 20, 2020 at 9:12

CommunityBot

11 silver badge

answered Dec 17, 2013 at 14:57

Mispy

8891 gold badge9 silver badges23 bronze badges

Collectives™ on Stack Overflow

Optimizing a Rails/Postgres many to many query

2 Answers 2

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related