1

Is there a way to efficiently order by a text field in a join query (i.e., fast on large datasets)? I understand that I probably need to filter the dataset somehow, to reduce the size of the dataset during sorting, but I haven't yet figured out how to do it. I need to output the query results in ascending order.

The "values" table has 27,515,151 records (id and text value), and the "entity_values" table (id, entity_id, value_id, attribute_id) has 31,875,250 records. When I join these tables, it takes 40 seconds to retrieve the first results ordered by value. However, without sorting, the query runs quickly. However, when I order by descending, the query executes quickly.

I have indexes on the values table (id, value) and on the entity_values table (attribute_id, value_id). Please, help me

SELECT v.value
FROM entity_values ev
JOIN values v ON v.id = ev.value_id
WHERE ev.attribute_id = 5
ORDER BY v.value
LIMIT 10

Here is also the execution plan

Limit  (cost=1001.02..2149.62 rows=10 width=4) (actual time=67078.933..67079.005 rows=10 loops=1)
  Output: v.value
  Buffers: shared hit=110035024 read=21958968
  ->  Gather Merge  (cost=1001.02..916271399.57 rows=7977313 width=4) (actual time=67078.932..67079.000 rows=10 loops=1)
        Output: v.value
        Workers Planned: 2
        Workers Launched: 2
        Buffers: shared hit=110035024 read=21958968
        ->  Nested Loop  (cost=1.00..915349619.69 rows=3323880 width=4) (actual time=39414.285..39414.287 rows=7 loops=3)
              Output: v.value
              Buffers: shared hit=110035024 read=21958968
              Worker 0: actual time=36252.205..36252.208 rows=10 loops=1
                Buffers: shared hit=33818981 read=6259066
              Worker 1: actual time=14912.457..14912.459 rows=10 loops=1
                Buffers: shared hit=9003723 read=1604262
              ->  Parallel Index Scan using values_value_index on public.values v  (cost=0.44..8957752.78 rows=11464647 width=12) (actual time=0.836..24843.321 rows=9168174 loops=3)
                    Output: v.value, v.id
                    Buffers: shared hit=16935 read=21958955
                    Worker 0: actual time=0.428..22411.536 rows=8453050 loops=1
                      Buffers: shared hit=6779 read=6259063
                    Worker 1: actual time=0.665..10883.462 rows=2250302 loops=1
                      Buffers: shared hit=2513 read=1604259
              ->  Index Only Scan using index_attributeid_and_valueid on public.entity_values ev  (cost=0.56..76.38 rows=268 width=8) (actual time=0.001..0.001 rows=0 loops=27504522)
                    Output: ev.attribute_id, ev.value_id
                    Index Cond: ((ev.attribute_id = 5) AND (ev.value_id = v.id))
                    Heap Fetches: 21
                    Buffers: shared hit=110018089 read=13
                    Worker 0: actual time=0.001..0.001 rows=0 loops=8453050
                      Buffers: shared hit=33812202 read=3
                    Worker 1: actual time=0.002..0.002 rows=0 loops=2250302
                      Buffers: shared hit=9001210 read=3
Planning Time: 2.337 ms
Execution Time: 67079.040 ms
6
  • It sounds like your index is in descending order if that query is fast... just change your index to ascending. Commented Mar 22 at 2:22
  • 1
    Could you please use plain text instead of images. And use explain (analyze, verbose, buffers) to not just the plan, but more importantly the execution of the plan Commented Mar 22 at 2:58
  • If the number of records in entity_values with attribute_id = 5 is significantly smaller than the ~32m rows in that table, is it worth creating a subset/CTE of entity_values just where attribute_id = 5, and then match to that subset? Commented Mar 22 at 10:40
  • @FrankHeikens, thank you for your response. I've edited my question. Now you can see the execution plan Commented Mar 22 at 13:53
  • Hi @DaleK . Thank you for your response. No, the index is in the ascending order only. CREATE UNIQUE INDEX values_pkey ON public.values USING btree (id) CREATE INDEX values_id_value_index ON public.values USING btree (id, value) Commented Mar 22 at 13:56

2 Answers 2

0

The question is:

Is there a way to efficiently order by a text field in a join query...?

You are joining two tables, where you are filtering (equality) on one of them and sorting on the other one.

A straightforward solution is what PostgreSQL is doing now, that is, to reading one table (the driving table) and use a nested loop to read and filter the secondary table. This can work well for small tables but won't scale well for large or super large tables.

I can think of two generic solutions to keep high performance on a query like this one:

  1. Multi-table Index. Unfortunately this functionality is not available in PostgreSQL; I've only seen them in Oracle so far. This index will be able to filter and sort in one simple index scan. Since you are using PostgreSQL this won't work.

  2. Redundancy. If you add redundancy to the secondary table and index it, you'll be able to query it without even joining the primary one. That is:

create table values (
  id int primary key not null,
  value int,
  attribute_id int -- this is a redundant value copied from the entity value.
);

create index ix1 on values (attribute_id, value);

You'll need to add a couple of triggers to keep attribute_id always up to date, when a row is inserted or updated in the table entity_values. I leave that as an exercise to you.

With this redundancy in place, now the query becomes much simpler since it can work on a single table. It should look like:

SELECT value
FROM values
WHERE attribute_id = 5
ORDER value
LIMIT 10

Finally, adding redundancy to the database comes with its own problems, so the decision needs to be taken carefully. There's always the chance that values loose their synchronization. Also, triggers come with their own problems as well; typically is quite difficult to version them, or to keep them in the SCM.

Sign up to request clarification or add additional context in comments.

1 Comment

Thank you for your response. I considered this approach but was concerned that it might be an anti-pattern and could lead to unsynchronized data in my database. However, since there are no better options, I'll proceed with this one. I appreciate the advice!
0

In your execution plan, the nested loop join is the cause of the high time consumption. In many cases, you can replace joins with subqueries to achieve an acceleration:

SELECT v.value FROM values v WHERE EXISTS (  
    SELECT * FROM entity_values ev  
    WHERE v.id = ev.value_id AND ev.attribute_id = 5  
)
ORDER BY v.value LIMIT 10

1 Comment

Thank you for your response! I just tried your approach, but unfortunately, it didn’t work well for me. The execution time remains the same.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.