1

How can I find rows in huge tables which have no foreign key pointing to them efficiently in postgresql?

Let's say we have a table of oranges and orange analysis results. The orange primary key is not controlled by us, and is not in any particular order. Oranges have columns which are relevant for the analysis done by a dedicated programs.

Table oranges (PK over orange_id, btree over created_at):

orange_id | raw_orange_data | created_at
1         | '{ "foo": 5}'   | '2021-08-09 15:00:00'
4092141   | '{ "foo": 42}'  | '2021-08-09 16:00:00'
42        | '{ "foo": 13}'  | '2021-08-09 11:00:00'

Multiple versions of said other program exist, and we want to keep the results for comparability. How should we arrange the foreign keys so that we can select the next oranges which needs processing efficiently?

Table orange_analysis (PK over orange_id, analysis_version):

orange_id | analysis_version| analysis_result
1         | 1               | 9000
1         | 2               | 9001
4092141   | 1               | 50
4092141   | 2               | 60

We are currently considering

SELECT *
FROM oranges
LEFT JOIN orange_analysis ON oranges.id = orange_analysis.orange_id
WHERE (oranges_analysis.orange_id IS NULL AND oranges_analysis.analysis_version = 1)
ORDER BY oranges.created_at DESC
LIMIT 500

or a NOT EXISTS query, but I fear that they are unable to use an index.

Is there a way to organize our tables to ensure such queries run fast? Can it be done without postgres walking over the oranges table? If we didn't have multiple versions I'd use a nullable FK to orange_analysis, but unfortunately we must maintain multiple analysis versions.

The only thing I came up with was to make analysis_result nullable, create it with null for all oranges for all analysis variants, put an index over it, and set the analysis_result column to its respective value once analysis progresses.

2
  • Please explain this: "How should we arrange the foreign keys so that we can select the next oranges which needs processing efficiently?" How can the data help solve this question? What is "processing efficiently"? What are "next oranges"? And how are these represented in the data? Commented Aug 26, 2021 at 11:27
  • We have billions of oranges, and the analyzing programs need to fetch a batch of oranges they have to process next. The "next oranges" are the newest oranges that have no entry in the orange_analysis table with the corresponding analyzer version. Commented Aug 26, 2021 at 11:33

2 Answers 2

2

If you are looking for oranges that are missing a particular version, then you can use:

SELECT o.*
FROM oranges o LEFT JOIN
     orange_analysis oa
     ON o.id = oa.orange_id AND oa.analysis_version = 1
WHERE oa.orange_id IS NULL 
ORDER BY o.created_at DESC
LIMIT 500;

For performance, though, I might suggest NOT EXISTS with particular indexes:

SELECT o.*
FROM oranges o 
WHERE NOT EXISTS (SELECT 1
                  FROM orange_analysis oa
                  WHERE o.id = oa.orange_id AND oa.analysis_version = 1
                 )
ORDER BY o.created_at DESC
LIMIT 500;

The indexes you want are oranges(created_at desc, id) and orange_analysis(orange_id, analysis_version). These are also appropriate for the LEFT JOIN version, but I'm not sure if they will avoid sorting for the ORDER BY.

Sign up to request clarification or add additional context in comments.

5 Comments

I tried your WHERE NOT EXISTS query, but it takes several seconds on a table with 5 million oranges. postgres' explain lists a cheap "Index Only Scan" over orange_analysis, an expensive "Index Scan Backward" over oranges (0.43..731360.49 rows=5240849), after that a "Nested Loop Anti Join" (0.86..9307676.19 rows=2524071), and finally the limit (0.045..8141.774). Without the ORDER BY the query is fast though, would you know anything that makes it bearable with the ORDER BY?
@Benni . . . I'm not sure why the index scan would be backwards. In any case, it should be scanning oranges until it hits 50 matches to the where clause. If it has to go through the whole table, that would be expensive.
I should note that the index scan backward did not use the new oranges(created_at desc, id) index, but another btree index over created_at only. If this query does a full data walk I fear it is not suitable for this use case, because there will be just a few (new and urgent) oranges at the beginning, and then it has to skip millions of oranges to get to the backlog oranges at the end. Worst case I can split my problem into two parts, constrain the ordered oranges to a small time range, and process the backlog without enforced ordering, but is there a way around it?
Would shenanigans like maintaining an int[] column in oranges that denotes which analysises have been done work? If there are GIN indexes over that column and created_at, can postgres grab the most urgent rows that do not have a certain analysis_version value in the array? My experience with GIN indexes is non-existent, I'm afraid, so I cannot estimate the viability and potential future problems.
We have fallen back to use a dedicated table for tracking pending oranges, which we insert with an "on insert" trigger. Thanks for your input though, I guess your proposal is as fast as its gonna get.
0

I think that a plain set operation is often the most efficient way:

SELECT id FROM oranges
EXCEPT
SELECT orange_id FROM orange_analysis
   WHERE analysis_version = 1;

That will output only those rows from the first query that do not occur as results of the second query (set difference).

4 Comments

Without proper indexes this takes ages, unfortunately
That is also possible - it depends on the row counts. An index on the WHERE condition will be beneficial in all cases.
No, only on the cases where a long data walk on oranges can be avoided
Of course, but you didn't provide an execution plan, so we can only guess.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.