0

I have a many to many relationship which looks like this:

class Feed
  has_many :cross_listed_papers, through: :cross_lists, source: :paper
end

Frequently (i.e. on page load), I will need to look up, sort and paginate the cross listed papers for a set of feeds over a given time period. However, this can be very slow:

[107] pry(main)> Paper.includes(:cross_lists).where(cross_lists: { feed_id: [311,312,313,314] }).where("pubdate >= ? and pubdate <= ?", Time.now - 300.days, Time.now).count

D, [2013-12-17T21:56:43.640283 #19404] DEBUG -- : (228.9ms) SELECT COUNT(DISTINCT "papers"."id") FROM "papers" LEFT OUTER JOIN "cross_lists" ON "cross_lists"."paper_id" = "papers"."id" WHERE "cross_lists"."feed_id" IN (311, 312, 313, 314) AND (pubdate >= '2013-02-20 10:56:43.404072' and pubdate <= '2013-12-17 10:56:43.404234') => 2811

228.9 ms is not ideal for something that may end up happening every page load, particularly since this inflates rapidly if I try to join more data in (even for less extensive time ranges). Here's the EXPLAIN ANALYZE:

Aggregate  (cost=110771.10..110771.11 rows=1 width=4) (actual time=243.826..243.826 rows=1 loops=1)
  ->  Hash Join  (cost=95343.72..110749.09 rows=8807 width=4) (actual time=93.725..242.725 rows=2830 loops=1)
        Hash Cond: (cross_lists.paper_id = papers.id)
        ->  Bitmap Heap Scan on cross_lists  (cost=2876.53..15182.11 rows=158372 width=4) (actual time=15.496..90.232 rows=162981 loops=1)
              Recheck Cond: (feed_id = ANY ('{311,312,313,314}'::integer[]))
              ->  Bitmap Index Scan on index_cross_lists_on_feed_id_and_cross_list_date  (cost=0.00..2836.94 rows=158372 width=0) (actual time=14.383..14.383 rows=162981 loops=1)
                    Index Cond: (feed_id = ANY ('{311,312,313,314}'::integer[]))
        ->  Hash  (cost=91670.95..91670.95 rows=48499 width=4) (actual time=76.079..76.079 rows=48853 loops=1)
              Buckets: 4096  Batches: 2  Memory Usage: 861kB
              ->  Bitmap Heap Scan on papers  (cost=1033.46..91670.95 rows=48499 width=4) (actual time=6.495..61.230 rows=48853 loops=1)
                    Recheck Cond: ((pubdate >= '2013-02-20'::date) AND (pubdate <= '2013-12-17'::date))
                    ->  Bitmap Index Scan on index_papers_on_pubdate  (cost=0.00..1021.34 rows=48499 width=0) (actual time=5.437..5.437 rows=48855 loops=1)
                          Index Cond: ((pubdate >= '2013-02-20'::date) AND (pubdate <= '2013-12-17'::date))
Total runtime: 244.295 ms

Are there indexes I can use to speed up this kind of query, or do I need to resort to denormalization?

2
  • The query you posted and explain analyzed doesn't seem to correspond to your ruby code, or the ruby code might not be doing what you're hoping: there's no sort operator in the plan. That's valid since you're counting, but it hints at an error in your logic. Commented Dec 17, 2013 at 11:17
  • Good point. Question is updated. Commented Dec 17, 2013 at 11:26

2 Answers 2

1

The query plan seems sane. As you can see from the choices done by the planner, you're already using indexes the best you can, i.e. to do bitmap index scans.

What's taking time is joining 49k rows with 163k rows. There's little you can do about it, seeing how neither of your criteria can be pre-aggregated.

Question the rational of actually running the query frequently. I'm guessing this is run to calculate the total number of pages? If so, couldn't you cache it until that number changes? (If not, perhaps post more information and explain what the query is trying to achieve in plain english.)

Sign up to request clarification or add additional context in comments.

1 Comment

I was afraid of that. Looks like aggressive caching is the way to go. Will accept answer if nobody comes up with anything else
0

I did end up deciding to denormalize very slightly. By moving pubdate on to the cross_lists table instead, I get an ~10x speedup:

[185] pry(main)> Paper.joins(:cross_lists).where("cross_lists.feed_id IN (?) AND cross_lists.cross_list_date >= ? AND cross_lists.cross_list_date <= ?", feed_ids, Time.now - 300.days, Time.now).count

(22.4ms) SELECT COUNT(*) FROM "papers" INNER JOIN "cross_lists" ON "cross_lists"."paper_id" = "papers"."id" WHERE (cross_lists.feed_id IN (311,312,313,314) AND cross_lists.cross_list_date >= '2013-02-20 14:33:29.034243' AND cross_lists.cross_list_date <= '2013-12-17 14:33:29.034443') => 2830

I can then paginate on the result of this query and join in the other data post-limit, which vastly reduces the impact of large result sets.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.