I have a rather large dataset (millions of rows). I'm having trouble introducing a "distinct" concept to a certain query. (I putting distinct in quotes, because this could be provided by the posgtres keyword DISTINCT or a "group by" form).
A non-distinct search takes 1ms - 2ms ; all attempts to introduce a "distinct" concept have grown this to the 50,000ms - 90,000ms range.
My goal is to show the latest resources based on their most recent appearance in an event stream.
My non-distinct query is essentially this:
SELECT
resource.id AS resource_id,
stream_event.event_timestamp AS event_timestamp
FROM
resource
JOIN
resource_2_stream_event ON (resource.id = resource_2_stream_event.resource_id)
JOIN
stream_event ON (resource_2_stream_event.stream_event_id = stream_event.id)
WHERE
stream_event.viewer = 47
ORDER BY event_timestamp DESC
LIMIT 25
;
I've tried many different forms of queries (and subqueries) using DISTINCT, GROUP BY and MAX(event_timestamp). The issue isn't getting a query that works, it's getting one that works in a reasonable execution time. Looking at the EXPLAIN ANALYZE output for each one, everything is running off of indexes. Th problem seems to be that with any attempt to deduplicate my results, postges must assemble the entire resultset onto disk; since each table has millions of rows, this becomes a bottleneck.
--
update
here's a working group-by query:
EXPLAIN ANALYZE
SELECT
resource.id AS resource_id,
max(stream_event.event_timestamp) AS stream_event_event_timestamp
FROM
resource
JOIN resource_2_stream_event ON (resource_2_stream_event.resource_id = resource.id)
JOIN stream_event ON stream_event.id = resource_2_stream_event.stream_event_id
WHERE (
(stream_event.viewer_id = 57) AND
(resource.condition_1 IS NOT True) AND
(resource.condition_2 IS NOT True) AND
(resource.condition_3 IS NOT True) AND
(resource.condition_4 IS NOT True) AND
(
(resource.condition_5 IS NULL) OR (resource.condition_6 IS NULL)
)
)
GROUP BY (resource.id)
ORDER BY stream_event_event_timestamp DESC LIMIT 25;
looking at the query planner (via EXPLAIN ANALYZE), it seems that adding in the max+groupby clause (or a distinct) forces a sequential scan. that is taking about half the time to computer. there already is an index that contains every "condition", and i tried creating a set of indexes (one for each element). none work.
in any event, the difference is between 2ms and 72,000ms
EXPLAIN ANALYZE? And how does the full code look with the distinct query you mention below?