PostgreSQL distinct on + order by query optimization

Question

I'm having a small issue here with a query.

SELECT DISTINCT ON ("reporting_processedamazonsnapshot"."offer_id") *
FROM "reporting_processedamazonsnapshot" INNER JOIN 
     "offers_boooffer"
        ON ("reporting_processedamazonsnapshot"."offer_id" =
            "offers_boooffer"."id") INNER JOIN
     "offers_offersettings"
        ON ("offers_boooffer"."id" = "offers_offersettings"."offer_id")
WHERE "offers_offersettings"."account_id" = 20
ORDER BY "reporting_processedamazonsnapshot"."offer_id" ASC,
         "reporting_processedamazonsnapshot"."scraping_date" DESC

I have an index called latest_scraping on offer_id ASC, scraping_date DESC but for some reason, PostgreSQL is still doing a sort after using the index causing a huge performance issue.

I don't understand why it's not using the already sorted data instead of redoing a sort. Is my index wrong? Or should I try to do my query another way?

Here's the explain with its actual data

'Unique  (cost=21260.47..21263.06 rows=519 width=1288) (actual time=38053.685..38177.348 rows=1783 loops=1)'
'  ->  Sort  (cost=21260.47..21261.76 rows=519 width=1288) (actual time=38053.683..38161.478 rows=153095 loops=1)'
'        Sort Key: reporting_processedamazonsnapshot.offer_id, reporting_processedamazonsnapshot.scraping_date DESC'
'        Sort Method: external merge  Disk: 162088kB'
'        ->  Nested Loop  (cost=41.90..21237.06 rows=519 width=1288) (actual time=70.874..36148.348 rows=153095 loops=1)'
'              ->  Nested Loop  (cost=41.47..17547.90 rows=1627 width=8) (actual time=54.287..126.740 rows=1784 loops=1)'
'                    ->  Bitmap Heap Scan on offers_offersettings  (cost=41.04..4823.48 rows=1627 width=4) (actual time=52.532..84.102 rows=1784 loops=1)'
'                          Recheck Cond: (account_id = 20)'
'                          Heap Blocks: exact=38'
'                          ->  Bitmap Index Scan on offers_offersettings_account_id_fff7a8c0  (cost=0.00..40.63 rows=1627 width=0) (actual time=49.886..49.886 rows=4132 loops=1)'
'                                Index Cond: (account_id = 20)'
'                    ->  Index Only Scan using offers_boooffer_pkey on offers_boooffer  (cost=0.43..7.81 rows=1 width=4) (actual time=0.019..0.020 rows=1 loops=1784)'
'                          Index Cond: (id = offers_offersettings.offer_id)'
'                          Heap Fetches: 1784'
'              ->  Index Scan using latest_scraping on reporting_processedamazonsnapshot  (cost=0.43..1.69 rows=58 width=1288) (actual time=0.526..20.146 rows=86 loops=1784)'
'                    Index Cond: (offer_id = offers_boooffer.id)'
'Planning time: 187.133 ms'
'Execution time: 38195.266 ms'

Have you ever heard of table aliases? You query is quite difficult to read. — Gordon Linoff
– Gordon Linoff, Commented Jul 31, 2018 at 23:12
@GordonLinoff No, I'm actually not familiar with SQL. I understand queries, I can write some but I generally try to avoid writing them. I use django with its ORM in order to interact with my DB. The query above comes from django and I simplified it a bit for better comprehension. — PhilipGarnero
– PhilipGarnero, Commented Aug 1, 2018 at 9:02

Laurenz Albe · Accepted Answer · 2018-08-01 12:47:16Z

1

To use the index to avoid the sort, PostgreSQL would first have to scan all of "reporting_processedamazonsnapshot" in index order, then join all of "offers_boooffer" using a nested loop join (so that the order is preserved) and then join all of "offers_offersettings", again using a nested loop join.

Finally, all rows that don't match the condition "offers_offersettings"."account_id" = 20 would be thrown away.

PostgreSQL believes – correctly in my opinion – that it is more efficient to start by reducing the number of rows as much as possible using the condition, then use the most efficient join method to join the tables and then sort for the DISTINCT clause.

I wonder if the following query might be faster:

SELECT DISTINCT ON (q.offer_id) *
FROM offers_offersettings ofs
   JOIN offers_boooffer bo ON bo.id = ofs.offer_id
   CROSS JOIN LATERAL
      (SELECT *
       FROM reporting_processedamazonsnapshot r
       WHERE r.offer_id = bo.offer_id
       ORDER BY r.scraping_date DESC
       LIMIT 1) q
WHERE ofs.account_id = 20
ORDER BY q.offer_id ASC, q.scraping_date DESC;

The execution plan would be similar, except that fewer rows would have to be scanned from the index, which should reduce execution time where you need it most.

If you want to speed up the sort, increase work_mem to some 500MB for that query (if you can afford it).

edited Aug 1, 2018 at 12:47

answered Aug 1, 2018 at 9:02

Laurenz Albe

257k22 gold badges312 silver badges388 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

PhilipGarnero Over a year ago

I did ANALYZE the tables and EXPLAIN returns the same results. What would you do to prevent the sort ?

Laurenz Albe Over a year ago

The only way to prevent a sort and have a reasonable execution time is to omit the DISTINCT and the ORDER BY. But how can sorting 692 rows take a lot of time? How do you know that the time is spent sorting? Did you look at EXPLAIN (ANALYZE) output?

PhilipGarnero Over a year ago

I edited my question to add the EXPLAIN (ANALYZE) output

Laurenz Albe Over a year ago

The sort only takes 2 seconds. 36 out of your 38 seconds are spent in the index scan on reporting_processedamazonsnapshot.

Laurenz Albe Over a year ago

I have come up with a suggestion for improvement.

Collectives™ on Stack Overflow

PostgreSQL distinct on + order by query optimization

1 Answer 1

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related