optimizing a large "distinct" select in postgres

Question

I have a rather large dataset (millions of rows). I'm having trouble introducing a "distinct" concept to a certain query. (I putting distinct in quotes, because this could be provided by the posgtres keyword DISTINCT or a "group by" form).

A non-distinct search takes 1ms - 2ms ; all attempts to introduce a "distinct" concept have grown this to the 50,000ms - 90,000ms range.

My goal is to show the latest resources based on their most recent appearance in an event stream.

My non-distinct query is essentially this:

SELECT
    resource.id AS resource_id,
    stream_event.event_timestamp AS event_timestamp
FROM
    resource
    JOIN
        resource_2_stream_event ON (resource.id = resource_2_stream_event.resource_id)
    JOIN
        stream_event ON (resource_2_stream_event.stream_event_id = stream_event.id)
WHERE
    stream_event.viewer = 47
ORDER BY event_timestamp DESC
LIMIT 25
;

I've tried many different forms of queries (and subqueries) using DISTINCT, GROUP BY and MAX(event_timestamp). The issue isn't getting a query that works, it's getting one that works in a reasonable execution time. Looking at the EXPLAIN ANALYZE output for each one, everything is running off of indexes. Th problem seems to be that with any attempt to deduplicate my results, postges must assemble the entire resultset onto disk; since each table has millions of rows, this becomes a bottleneck.

--

update

here's a working group-by query:

EXPLAIN ANALYZE 
SELECT
    resource.id AS resource_id,
    max(stream_event.event_timestamp) AS stream_event_event_timestamp
FROM 
    resource 
    JOIN resource_2_stream_event ON (resource_2_stream_event.resource_id = resource.id) 
    JOIN stream_event ON stream_event.id = resource_2_stream_event.stream_event_id
WHERE (
        (stream_event.viewer_id = 57) AND 
        (resource.condition_1 IS NOT True) AND 
        (resource.condition_2 IS NOT True) AND 
        (resource.condition_3 IS NOT True) AND 
        (resource.condition_4 IS NOT True) AND 
        ( 
            (resource.condition_5 IS NULL) OR (resource.condition_6 IS NULL) 
        )
    )
GROUP BY (resource.id)
ORDER BY stream_event_event_timestamp DESC LIMIT 25;

looking at the query planner (via EXPLAIN ANALYZE), it seems that adding in the max+groupby clause (or a distinct) forces a sequential scan. that is taking about half the time to computer. there already is an index that contains every "condition", and i tried creating a set of indexes (one for each element). none work.

in any event, the difference is between 2ms and 72,000ms

can you add a full version of working query that gets you desired result. also if several variations give same result, show them. also how critical are the these WHERE / ORDER BY and LIMIT 25? — Bulat
– Bulat, Commented Sep 23, 2014 at 22:33
How about posting said EXPLAIN ANALYZE? And how does the full code look with the distinct query you mention below? — Jakub Kania
– Jakub Kania, Commented Sep 24, 2014 at 0:33
0) intention of your query 1) table definitions, including indexes 2) resulting query plan 3) relevant configuration settings. — wildplasser
– wildplasser, Commented Sep 24, 2014 at 10:42
Perhaps you can use a CTE to materialize the "fast" version of the query and then do other operations afterwards. — Gordon Linoff
– Gordon Linoff, Commented Sep 24, 2014 at 11:22

Gordon Linoff · Accepted Answer · 2014-09-24 20:53:33Z

2

Often, distinct on is the most efficient way to get one row per something. I would suggest trying:

SELECT DISTINCT ON (r.id) r.id AS resource_id, se.event_timestamp
FROM resource r JOIN
     resource_2_stream_event r2se
     ON r.id = r2se.resource_id JOIN
     stream_event se
     ON r2se.stream_event_id = se.id
WHERE se.viewer = 47
ORDER BY r.id, se.event_timestamp DESC
LIMIT 25;

An index on resource(id, event_timestamp) might help performance.

EDIT:

You might try using a CTE to get what you want:

WITH CTE as (
      SELECT r.id AS resource_id,
             se.event_timestamp AS stream_event_event_timestamp
      FROM resource r JOIN
           resource_2_stream_event r2se
           ON r2se.resource_id = r.id JOIN
           stream_event se
           ON se.id = r2se.stream_event_id
      WHERE ((se.viewer_id = 57) AND 
             (r.condition_1 IS NOT True) AND 
             (r.condition_2 IS NOT True) AND 
             (r.condition_3 IS NOT True) AND 
             (r.condition_4 IS NOT True) AND 
             ( (r.condition_5 IS NULL) OR (r.condition_6 IS NULL) 
             )
            )
    )
SELECT resource_id, max(stream_event_event_timestamp) as stream_event_event_timestamp
FROM CTE
GROUP BY resource_id
ORDER BY stream_event_event_timestamp DESC
LIMIT 25;

Postgres materializes the CTE. So, if there are not that many matches, this may speed the query by using indexes for the CTE.

edited Sep 24, 2014 at 20:53

answered Sep 23, 2014 at 22:42

Gordon Linoff

1.3m62 gold badges705 silver badges857 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Jonathan Vanasco Over a year ago

i need to order by the most recent event_timestamp, so i need to use a nested subquery when using distinct like this. the select i need is EXPLAIN ANALYZE SELECT sq.* FROM ( DISTINCT_QUERY ) sq ORDER BY sq.event_timestamp DESC LIMIT 25;, where DISTINCT_QUERY is your query. It runs slightly faster than the other select (betweeen 1% and 4%), but is still too long.

Andrew Lazarus Over a year ago

Does it work to remove r.id from the ORDER clause without demoting to a subquery?

Jonathan Vanasco Over a year ago

thanks for the help! this is definitely getting there! down to 20s. (fyi, the max needs to be removed from the subquery ). going to play around with CTE some more.

Gordon Linoff Over a year ago

The next idea would be to add something like order by stream_event_timestamp desc limit 1000 to the CTE and hope that you have at least 25 resource ids in that list.

Jonathan Vanasco Over a year ago

I had a decent performance bump migrating the join+where on r outside of the CTE. it brought me down to the 4s-7s. range. I still need to be under 500 ms, but that was a boost.

|

Collectives™ on Stack Overflow

optimizing a large "distinct" select in postgres

1 Answer 1

6 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

6 Comments

Your Answer

Sign up or log in

Post as a guest

Related