Is there a way to efficiently order by a text field in a join query (i.e., fast on large datasets)? I understand that I probably need to filter the dataset somehow, to reduce the size of the dataset during sorting, but I haven't yet figured out how to do it. I need to output the query results in ascending order.
The "values" table has 27,515,151 records (id and text value), and the "entity_values" table (id, entity_id, value_id, attribute_id) has 31,875,250 records. When I join these tables, it takes 40 seconds to retrieve the first results ordered by value. However, without sorting, the query runs quickly. However, when I order by descending, the query executes quickly.
I have indexes on the values table (id, value) and on the entity_values table (attribute_id, value_id). Please, help me
SELECT v.value
FROM entity_values ev
JOIN values v ON v.id = ev.value_id
WHERE ev.attribute_id = 5
ORDER BY v.value
LIMIT 10
Here is also the execution plan
Limit (cost=1001.02..2149.62 rows=10 width=4) (actual time=67078.933..67079.005 rows=10 loops=1)
Output: v.value
Buffers: shared hit=110035024 read=21958968
-> Gather Merge (cost=1001.02..916271399.57 rows=7977313 width=4) (actual time=67078.932..67079.000 rows=10 loops=1)
Output: v.value
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=110035024 read=21958968
-> Nested Loop (cost=1.00..915349619.69 rows=3323880 width=4) (actual time=39414.285..39414.287 rows=7 loops=3)
Output: v.value
Buffers: shared hit=110035024 read=21958968
Worker 0: actual time=36252.205..36252.208 rows=10 loops=1
Buffers: shared hit=33818981 read=6259066
Worker 1: actual time=14912.457..14912.459 rows=10 loops=1
Buffers: shared hit=9003723 read=1604262
-> Parallel Index Scan using values_value_index on public.values v (cost=0.44..8957752.78 rows=11464647 width=12) (actual time=0.836..24843.321 rows=9168174 loops=3)
Output: v.value, v.id
Buffers: shared hit=16935 read=21958955
Worker 0: actual time=0.428..22411.536 rows=8453050 loops=1
Buffers: shared hit=6779 read=6259063
Worker 1: actual time=0.665..10883.462 rows=2250302 loops=1
Buffers: shared hit=2513 read=1604259
-> Index Only Scan using index_attributeid_and_valueid on public.entity_values ev (cost=0.56..76.38 rows=268 width=8) (actual time=0.001..0.001 rows=0 loops=27504522)
Output: ev.attribute_id, ev.value_id
Index Cond: ((ev.attribute_id = 5) AND (ev.value_id = v.id))
Heap Fetches: 21
Buffers: shared hit=110018089 read=13
Worker 0: actual time=0.001..0.001 rows=0 loops=8453050
Buffers: shared hit=33812202 read=3
Worker 1: actual time=0.002..0.002 rows=0 loops=2250302
Buffers: shared hit=9001210 read=3
Planning Time: 2.337 ms
Execution Time: 67079.040 ms
entity_valueswithattribute_id = 5is significantly smaller than the ~32m rows in that table, is it worth creating a subset/CTE ofentity_valuesjust whereattribute_id = 5, and then match to that subset?