3

See query below

Select count(*) FROM
(Select distinct Student_ID, Name, Student_Age, CourseID from student) a1
JOIN
(Select distinct CourseID, CourseName, TeacherID from courses) a2
ON a1.CourseID=a2.CourseID
JOIN 
(Select distinct TeacherID, TeacherName, Teacher_Age from teachers) a3
ON a2.TeacherID=a3.TeacherID

The subqueries must be used for deduping purpose.

This query run fine in PostgreSQL. However, if I add a condition between the student and teacher table, according to the execution plan, Postgres will wrongly nested loop join the student and teach tables which have no direct relationship. For example:

Select count(*) FROM
(Select distinct Student_ID, Name, Student_Age, CourseID from student) a1
JOIN
(Select distinct CourseID, CourseName, TeacherID from courses) a2
ON a1.CourseID=a2.CourseID
JOIN 
(Select distinct TeacherID, TeacherName, Teacher_Age from teachers) a3 ON
 a2.TeacherID=a3.TeacherID
WHERE Teacher_Age>=Student_Age

This query will take forever to run. However, if I replace the subqueries with the tables, it'll run very fast. Without using temp tables to store the deduping result, is there a way to to avoid the nested loop in this situation?

Thank you for your help.

6
  • 2
    Why do you want to join the inline query when you can join the table itself. Consider, doing a LEFT JOIN instead and put the condition in join clause instead WHERE. Commented Oct 7, 2014 at 19:09
  • The subqueries must be used for deduplication. We have a lot of that in our dataset. Also each table above contains about 3M records. Commented Oct 7, 2014 at 19:11
  • If you have a lot of duplication in the student, teachers and courses tables, it sounds like a flaw in your schema. The unique identifying attributes should be in one table and whatever data is related to them and causing the duplicates when you select that identifying data should be in one or more other tables. Commented Oct 8, 2014 at 3:59
  • Can we your plans with buffers option? Commented Oct 8, 2014 at 6:25
  • @gwaigh: I can't do anything to change the schema or the data. These tables are the results of multi-sites data integration. At each site, each student is unique. However, since data about one same student can be stored at multiple sites when the data are integrated, duplication occurred. Commented Oct 8, 2014 at 11:42

1 Answer 1

2

You're making the database perform a lot of unnecessary work to accomplish your goal. Instead of doing 3 different SELECT DISTINCT sub-queries all joined together, try joining the base tables directly to each other and let it handle the DISTINCT part only once. If your tables have proper indexes on the ID fields, this should run rather quick.

SELECT COUNT(1)
    FROM (
    SELECT DISTINCT s.Student_ID, c.CourseID, t.TeacherID
        FROM student s
        JOIN courses c ON s.CourseID = c.CourseID
        JOIN teachers t ON c.TeacherID = t.TeacherID
        WHERE t.Teacher_Age >= s.StudentAge
     ) a
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.