1

One for all you MySQL experts :-)

I have the following query:

SELECT o.*, p.name, p.amount, p.quantity 
FROM orders o, products p 
WHERE o.id = p.order_id AND o.total != '0.00' AND DATE(o.timestamp) BETWEEN '2012-01-01' AND '2012-01-31' 
ORDER BY o.timestamp ASC
  • orders table = 80,900 rows
  • products table = 125,389 rows
  • o.id and p.order_id are indexed

The query takes about 6 seconds to complete - which is way too long. I am looking for a way to optimize it, possibly with temporary tables or a different type of join. I'm afraid my understanding of both of these concepts is pretty limited.

Can anyone suggest a way for me to optimize this query?

2
  • Do you have o.total values less than 0? If not better replace not equal operator (!=) with greater than (>) Commented Oct 16, 2012 at 11:23
  • Unfortunately I do, so I'm afraid I'm forced to stick with the != operator! Commented Oct 17, 2012 at 8:47

4 Answers 4

2

First, I would use a different style of syntax. ANSI-92 has had 20 years to bed in, and many RDBMS actually recommend not using the notation you have used. It's not going to make a difference in this case, but it really is very good practice for a host of reasons (that I'll let you investigate and make a decision on yourself).

Final answer, and example syntax:

SELECT
  o.*, p.name, p.amount, p.quantity  
FROM
  orders
INNER JOIN
  products
    ON orders.id = products.order_id 
WHERE
      orders.timestamp >= '2012-01-01'
  AND orders.timestamp <  '2012-02-01'
  AND orders.total     != '0.00' 
ORDER BY
  orders.timestamp ASC

As the orders table is the one you are making the initial filtering on, that's a very good place to start looking at optimisation.


With DATE(o.timestamp) BETWEEN x AND y you succeed in getting all dates and time in January. But that requires calling the DATE() function on every single row in the orders table (similar to what RBAR means). The RDBMS can't see through the function to just know how to avoid wasting time. Instead we need to do that optimisation, by re-arranging the maths to not need the function on the field we are filtering.

    orders.timestamp >= '2012-01-01'
AND orders.timestamp <  '2012-02-01'

This version allows the optimiser to know that you want a block of dates that are all sequential with each other. It's called a range-seek. It can use an index to very quickly find the first record and last record that fit that range, then pick out every record in between. That avoids checking all the records that don't fit, and even avoids checking all the records in the middle of the range; only the boundaries need to be sought out.

That assumes all the records are ordered by date, and that the optimiser can see that. To do so you need an index. With that in mind there seem to be two basic covering indexes that you could use:
- (id, timestamp)
- (timestamp, id)

The first is what I see people use the most. But that forces the optimiser to do the timestamp range-seek for each id separately. And since every id likely has a different timestamp value, you've gained nothing.

The second index is what I recommend.

Now, the optimiser can fullfill this part of your query, exceptionally quickly...

SELECT
  o.*
FROM
  orders
WHERE
      orders.timestamp >= '2012-01-01'
  AND orders.timestamp <  '2012-02-01'
ORDER BY
  orders.timestamp ASC

As it happens, even the ORDER BY has been optimised with the suggested index. It's already in the order that you want the data to be output. There is no need to re-sort everything after the join.


Then, to fullfill the total != '0.00' requirement, every row in your range is still checked. But you've already narrowed the range down so much that this will probably be fine. (I wont go in to it, but you will likely find it impossible to use indexes in MySQL to optimise this and the timestamp range-seek.)

Then, you have your join. That's optimised by an index you already have (products.order_id). For every record picked out by the snippet above, the optimiser can do an index seek and very quickly identify the matching record(s).


This all assumes that, in the vast majority of cases, every order row has one or more product rows. If, for example, only a very select few orders had any product rows, it may be faster to pick out the product rows of interest first; essentially looking at the joins happening in reverse order.

The optimiser actually makes that decision for you, but it's handy to know that it's doing that, then provide the indexes you estimate will be most useful to it.

You can check the explain plan to see if the indexes are being used. If not, your attempt to help was ignored. Probably because of the statistics of the data implying a different order of joining was better. If so you can then provide indexes to help that order of joins instead.

Sign up to request clarification or add additional context in comments.

8 Comments

This is a fantastic answer - thanks for explaining so thoroughly! The only part I'm a little confused on is adding an index to the timestamp field. I always thought this field was not unique - and therefore incapable of being indexed. Is my understanding incorrect?
@dai.hop - No, a field does not need to be non-unique to be indexed. It is often the default that the Primary Key is indexed (either as a Clustered Index or a Non-Clustered Index).
Excellent. I've applied the index as suggested, the query now runs in 1.9 seconds - a significant improvement!
@dai.hop - Not as much as I'd have expected though. Did you make the other changes too? (To the query itself?)
Yes, I ran the final answer after indexing the timestamp field. My index information now looks like this [screenshot](dropbox.com/s/yhtos9fyx2phjey/Screen Shot 2012-10-17 at 14.06.48.png)
|
2
  1. Use Explain to indicate how to optimise the query. I'd suggest starting with indices on Total and TimeStamp

  2. You may find removing the date function improves performance.

  3. You should use modern syntax.

eg.

SELECT o.*, p.name, p.amount, p.quantity  
FROM orders o
     inner join products p  
     on o.id = p.order_id 
WHERE o.total != '0.00' 
AND o.timestamp BETWEEN '2012-01-01' AND '2012-01-31 23:59'  
ORDER BY o.timestamp ASC 

2 Comments

I think you should explain why removing the DATE() call may help. And I disagree with BETWEEN '2012-01-01' AND '2012-01-31 23:59'. What about '2012-01-31 23:59:59'? Or '2012-01-31 23:59:59.99'? Don't use BETWEEN on datetime ranges, instead use o.timestamp >= '2012-01-01' AND o.timestamp < '2012-02-01'.
+1 I agree with @Dems that BETWEEN should be used with care, but can be used even to compare dates. But +1 is for explain, that is crucial in solving performance issues.
2

I'm not MySQL expert (more SQL Server) by I think you'd better have index on o.timestamp and you need to rewrite your query like this

o.timestamp >= '2012-01-01' and o.timestamp <= '2012-01-31' + INTERVAL 1 DAY

The logic is - index will not work if you compare some expression on column and constants. You need to compare column and constants

4 Comments

Except that this specific answer changes the functionality. What abut a timestamp of '2012-01-31 23:59:59.99'? Don't use BETWEEN at all, instead use... o.timestamp >= '2012-01-01' AND o.timestamp < '2012-02-01'.
yes, you right, you need to get two constants, cut off time and then add one day to ending date, the answer was more about the idea. I'll change an answer
There is no DATEADD() in MySQL. It's + INTERVAL 1 DAY.
:) ooops sorry again the mistake with function over column is very frequent and I've just wanted to show that this should be changed
1

SELECT *:

Selecting all columns with the * wildcard will cause the query's meaning and behavior to change if the table's schema changes, and might cause the query to retrieve too much data.

The != operator is non-standard:

Use the <> operator to test for inequality instead.

Aliasing without the AS keyword: Explicitly using the AS keyword in column or table aliases, such as "tbl AS alias," is more readable than implicit aliases such as "tbl alias".

2 Comments

Good pointers for best practices, but completely unrelated to performance in this case.
Thanks for the tip. I will keep this in mind when writing queries from now on.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.