1

I've spent a lot of time optimizing this query but it's starting to slow down with larger tables. I imagine these are probably the worst types of questions but I'm looking for some guidance. I'm not really at liberty to disclose the database schema so hopefully this is enough information. Thanks,

SELECT tblA.id, tblB.id, tblC.id, tblD.id
FROM tblA, tblB, tblC, tblD
INNER JOIN (SELECT max(tblB.id) AS xid
                FROM tblB
                WHERE tblB.rdd = 11305
                GROUP BY tblB.index_id
                ORDER BY NULL) AS rddx
           ON tblB.id = rddx.xid
WHERE
    tblA.id = tblB.index_id
    AND tblC.name = tblD.s_type
    AND tblD.name = tblA.s_name
GROUP BY tblA.s_name
ORDER BY NULL;

There is a one-to-many relationship between:

  • tblA.id and tblB.index_id
  • tblC.name and tblD.s_type
  • tblD.name and tblA.s_name
+----+-------------+------------+--------+---------------+-----------+---------+------------------------------+-------+------------------------------+
| id | select_type | table      | type   | possible_keys | key       | key_len | ref                          | rows  | Extra                        |
+----+-------------+------------+--------+---------------+-----------+---------+------------------------------+-------+------------------------------+
|  1 | PRIMARY     | derived2   | ALL    | NULL          | NULL      | NULL    | NULL                         | 32568 | Using temporary              |
|  1 | PRIMARY     | tblB       | eq_ref | PRIMARY       | PRIMARY   | 8       | rddx.xid                     |     1 |                              |
|  1 | PRIMARY     | tblA       | eq_ref | PRIMARY       | PRIMARY   | 8       | tblB.index_id                |     1 | Using where                  |
|  1 | PRIMARY     | tblD       | eq_ref | PRIMARY       | PRIMARY   | 22      | tblA.s_name                  |     1 | Using where                  |
|  1 | PRIMARY     | tblC       | eq_ref | PRIMARY       | PRIMARY   | 22      | tblD.s_type                  |     1 |                              |
|  2 | DERIVED     | tblB       | ref    | rdd_idx       | rdd_idx   | 7       |                              | 65722 | Using where; Using temporary |
+----+-------------+------------+--------+---------------+-----------+---------+------------------------------+-------+------------------------------+
4
  • 1
    What database engine are the tables using? Commented Nov 2, 2011 at 8:41
  • Please can you add some more information. For example, how many records satisfy the condition WHERE tblB.rdd = 11305 - is it really 65722? Removal of the temp table that gets generated will help the query, but it is hard to really tell as we don't know what is in these tables. Commented Nov 2, 2011 at 8:43
  • Thanks for the replies. The tables are InnoDB; tblB has 41,633 entries with rdd = 11305. Commented Nov 2, 2011 at 20:41
  • Do you have an index on (rdd, index_id) in tblB ? Commented Nov 3, 2011 at 0:42

2 Answers 2

2

Unless I've misunderstood the information that you've provided I believe you could re-write the above query as follows

EXPLAIN SELECT tblA.id, MAX(tblB.id), tblC.id, tblD.id
FROM tblA
LEFT JOIN tblD ON tblD.name = tblA.s_name
LEFT JOIN tblC ON tblC.name = tblD.s_type
LEFT JOIN tblB ON tblA.id = tblB.index_id
WHERE tblB.rdd = 11305
ORDER BY NULL;

Obviously I can't provide an explain for this as explain depends on the data in your database. It would be interesting to see the explain on this query.

Obviously explain only gives you an estimate of what will happen. You can use SHOW SESSION STATUS to provide in details of what happened when you run an actual query. Make sure to run before you run the query that you are investigating so that you have clean data to read from. So in this case you would run

FLUSH STATUS;

EXPLAIN SELECT tblA.id, MAX(tblB.id), tblC.id, tblD.id
FROM tblA
LEFT JOIN tblD ON tblD.name = tblA.s_name
LEFT JOIN tblC ON tblC.name = tblD.s_type
LEFT JOIN tblB ON tblA.id = tblB.index_id
WHERE tblB.rdd = 11305
ORDER BY NULL;

SHOW SESSION STATUS LIKE 'ha%';

This gives you a number of indicators to show what actually happened when a query executed.

Handler_read_rnd_next - Number of requests to read next row in the data file
Handler_read_key - Number of requests to read a row based on a key
Handler_read_next - Number of requests to read the next row in key order

Using these values you can see exactly what is going on under the hood.

Unfortunately without knowing the data in the tables, engine type and the data types used in the queries it is quite hard to advise on how you could optimize.

Sign up to request clarification or add additional context in comments.

Comments

1

I have updated the query using joins instead of the join within the WHERE clause. Also, by looking at it, as a developer, you can directly see the relationship between the tables. A->B, A->D and D->C. Now, on table B where you want the highest ID based on the common "ID=Index_ID" AND the RDD = 11305 won't require a complete sub-query. However, this has moved the "MAX()" to the upper portion of the field selection clause. I would ensure you have an index on tblB on (index_id, rdd). Finally, by doing STRAIGHT_JOIN will help enforce the order to run the query based on how specifically listed.

-- EDIT FROM COMMENT --

It appears you are getting nulls from the tblB. This typically indicates a valid tblA record, but no tblB record by same ID that has an RDD = 11305. That said, it appears you are only concerned with those entries associated with 11305, so I'm adjusting the query accordingly. Please make sure you have an index on tblB based on the "RDD" column (at least in the first position in case multiple column index)

As you can see in this one, I'm pre-querying from table B only for 11305 entries and pre-grouping by the index_ID (as linked to tblA). This gives me one record per index where they will exist... From THIS result, I'm joining back to A, then directly back to B again, but based on that highest match ID found, then D and C as was before. So NOW, you can get any column from any of the tables and get proper record in question... There should be no NULL values left in this query.

Hopefully, I've clarified HOW I'm getting the pieces together for you.

SELECT STRAIGHT_JOIN 
      PreQuery.HighestPerIndexID
      tblA.id, 
      tblA.AnotherAField,
      tblA.Etc,
      tblB.SomeOtherField,
      tblB.AnotherField,
      tblC.id, 
      tblD.id
   FROM 
      ( select PQ1.Index_ID,
               max( PQ1.ID ) as HighestPerIndexID
           from tblB PQ1
           where PQ1.RDD = 11305
           group by PQ1.Index_ID ) PreQuery

         JOIN tblA
            on PreQuery.Index_ID = tblA.ID

         join tblB
            on PreQuery.HighestPerIndexID = tblB.ID

         join tblD
            on tblA.s_Name = tblD.name

            join tblC
               on tblD.s_type = tblC.Name
    ORDER BY 
       tblA.s_Name

5 Comments

Thank you for the help. This suggestion didn't quite work for me but has me thinking in a new direction. I think you already understand, but let me clarify a little: There is a one-to-many relationship between tblA and tblB where for a given entry in tblA, there will be n associated entries in tblB, and m of them will have a specific 'rdd'. The purpose of the MAX(tblB.id) select clause is to obtain the most recent member of this relationship. This query seems to work, but I don't think it selects the entry from tblB with the highest 'id' for use in the join. Hope that makes sense.
@Doug, it SHOULD represent properly as the join will pull one record from every tblA for every record in tblB. The group by of all non-aggregates should ensure only the highest entry. Now, the problem. If you are trying to get other columns from table B, then MySQL is just grabbing the first instance it hits of the other columns which might be what you are encountering... if so, I can adjust the query to reflect what you REALLY are looking for.
@DRapp- An adjustment would be greatly appreciated then ;). I tried to slim down the query quite a bit but I am indeed getting other columns from tblB; they show up as NULL when I use this method.
@Doug, revised answer and more clarifications.
@DRapp- Thanks again. This seems much more like what I was looking for; There are two derived tables that say 'using temporary; using filesort' but the query does seem to be faster.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.