3

Assume we have 4 tables A, B, C, D in some unspecified relational SQL database. A references B, also C and D as well. References means that A has column A.refX_id = X.id, X is A, B and C (common foreign key 1:N).

What I want, is to query table A with condition based on columns from all child tables B, C and D. My question is: What of the following variants is generally better? (In terms of usability, efficiency, speed.)

Variant 1:

SELECT DISTINCT A.* FROM A
   JOIN B ON A.refB_id = B.id
   JOIN C ON A.refC_id = C.id
   JOIN D ON A.refD_id = D.id
   WHERE <condition on B> AND <condition on C> AND <condition on D>;

Which I like more from perspective of a database, but looks a little bit harder to program.

Variant 2:

SELECT id FROM B WHERE <condition on B>; # result store to array "BIds" on program side
SELECT id FROM C WHERE <condition on C>; # result store to array "CIds" on program side
SELECT id FROM D WHERE <condition on D>; # result store to array "DIds" on program side

SELECT A.* FROM A
   WHERE refB_id IN (<B_ids>) AND refC_id IN (<C_ids>) AND refD_id IN (<D_ids>);

# <B_ids> menas expand whole array of ids, which can result in a very long query string

I thought that Variant 2 is completelly smut and unusable with potentially large data. But I've heard, that a lot of frameworks normally use it, because it is relativelly straightforward. Is it sort of legal way to query data like this in general case, if I know that the content of the "IN" clause is taken from a result of another query(ies)?

1

2 Answers 2

4

I would encourage you to use IN or EXISTS:

SELECT A.*
FROM A
WHERE EXISTS (SELECT 1 FROM B WHERE A.refB_id = B.id AND <condition on B>) AND
      EXISTS (SELECT 1 FROM C WHERE A.refC_id = C.id AND <condition on C>) AND
      EXISTS (SELECT 1 FROM D WHERE A.refD_id = D.id AND <condition on D>);

Advantages of this approach:

  • No danger of getting large intermediate Cartesian products.
  • No expense of eliminating duplicates for SELECT DISTINCT.
  • Most databases handle EXISTS pretty well.
  • You can optimize each subquery using indexes.

EDIT:

You can write this with IN and subqueries:

SELECT A.*
FROM A
WHERE A.refB_id IN (SELECT B.id FROM B WHERE <condition on B>) AND
      A.refC_id IN (SELECT C.id FROM C WHERE <condition on C>) AND
      A.refD_id IN (SELECT D.id FROM D WHERE <condition on D>);
Sign up to request clarification or add additional context in comments.

3 Comments

Thanks a lot! Have never seen SELECT 1... How would it looks like with IN? Did you mean that Variant1??
Yes, this is more readable. Thank you, that is exactly what I wanted.
@David I disagree about readability but it depends on what you are used to seeing.
3

Not sure which framework uses the 2nd approach but first approach is what I will go for and moreover everyone else will. If you have proper index created on the join column for all tables then 1st approach will produce a much better plan than the 2nd one cause you have multiple IN clause and what if every IN has to work on millions of elements???.

Also, I would change the INNER JOIN to LEFT JOIN assuming that not all ID's may match and move the WHERE conditions to JOIN ON conditions like

SELECT DISTINCT A.* FROM A
   LEFT JOIN B ON A.refB_id = B.id AND <condition on B>
   LEFT JOIN C ON A.refC_id = C.id AND <condition on C>
   LEFT JOIN D ON A.refD_id = D.id AND <condition on D>;

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.