How can I find rows in huge tables which have no foreign key pointing to them efficiently in postgresql?
Let's say we have a table of oranges and orange analysis results. The orange primary key is not controlled by us, and is not in any particular order. Oranges have columns which are relevant for the analysis done by a dedicated programs.
Table oranges (PK over orange_id, btree over created_at):
orange_id | raw_orange_data | created_at
1 | '{ "foo": 5}' | '2021-08-09 15:00:00'
4092141 | '{ "foo": 42}' | '2021-08-09 16:00:00'
42 | '{ "foo": 13}' | '2021-08-09 11:00:00'
Multiple versions of said other program exist, and we want to keep the results for comparability. How should we arrange the foreign keys so that we can select the next oranges which needs processing efficiently?
Table orange_analysis (PK over orange_id, analysis_version):
orange_id | analysis_version| analysis_result
1 | 1 | 9000
1 | 2 | 9001
4092141 | 1 | 50
4092141 | 2 | 60
We are currently considering
SELECT *
FROM oranges
LEFT JOIN orange_analysis ON oranges.id = orange_analysis.orange_id
WHERE (oranges_analysis.orange_id IS NULL AND oranges_analysis.analysis_version = 1)
ORDER BY oranges.created_at DESC
LIMIT 500
or a NOT EXISTS query, but I fear that they are unable to use an index.
Is there a way to organize our tables to ensure such queries run fast? Can it be done without postgres walking over the oranges table? If we didn't have multiple versions I'd use a nullable FK to orange_analysis, but unfortunately we must maintain multiple analysis versions.
The only thing I came up with was to make analysis_result nullable, create it with null for all oranges for all analysis variants, put an index over it, and set the analysis_result column to its respective value once analysis progresses.
orange_analysistable with the corresponding analyzer version.