MySQL Query Optimisation - JOIN?

Question

One for all you MySQL experts :-)

I have the following query:

SELECT o.*, p.name, p.amount, p.quantity 
FROM orders o, products p 
WHERE o.id = p.order_id AND o.total != '0.00' AND DATE(o.timestamp) BETWEEN '2012-01-01' AND '2012-01-31' 
ORDER BY o.timestamp ASC

orders table = 80,900 rows
products table = 125,389 rows
o.id and p.order_id are indexed

The query takes about 6 seconds to complete - which is way too long. I am looking for a way to optimize it, possibly with temporary tables or a different type of join. I'm afraid my understanding of both of these concepts is pretty limited.

Can anyone suggest a way for me to optimize this query?

Do you have o.total values less than 0? If not better replace not equal operator (!=) with greater than (>) — niktrs
– niktrs, Commented Oct 16, 2012 at 11:23
Unfortunately I do, so I'm afraid I'm forced to stick with the != operator! — dai.hop
– dai.hop, Commented Oct 17, 2012 at 8:47

MatBailie · Accepted Answer · 2012-10-16 11:27:53Z

2

First, I would use a different style of syntax. ANSI-92 has had 20 years to bed in, and many RDBMS actually recommend not using the notation you have used. It's not going to make a difference in this case, but it really is very good practice for a host of reasons (that I'll let you investigate and make a decision on yourself).

Final answer, and example syntax:

SELECT
  o.*, p.name, p.amount, p.quantity  
FROM
  orders
INNER JOIN
  products
    ON orders.id = products.order_id 
WHERE
      orders.timestamp >= '2012-01-01'
  AND orders.timestamp <  '2012-02-01'
  AND orders.total     != '0.00' 
ORDER BY
  orders.timestamp ASC

As the orders table is the one you are making the initial filtering on, that's a very good place to start looking at optimisation.

With DATE(o.timestamp) BETWEEN x AND y you succeed in getting all dates and time in January. But that requires calling the DATE() function on every single row in the orders table (similar to what RBAR means). The RDBMS can't see through the function to just know how to avoid wasting time. Instead we need to do that optimisation, by re-arranging the maths to not need the function on the field we are filtering.

    orders.timestamp >= '2012-01-01'
AND orders.timestamp <  '2012-02-01'

This version allows the optimiser to know that you want a block of dates that are all sequential with each other. It's called a range-seek. It can use an index to very quickly find the first record and last record that fit that range, then pick out every record in between. That avoids checking all the records that don't fit, and even avoids checking all the records in the middle of the range; only the boundaries need to be sought out.

That assumes all the records are ordered by date, and that the optimiser can see that. To do so you need an index. With that in mind there seem to be two basic covering indexes that you could use:
- (id, timestamp)
- (timestamp, id)

The first is what I see people use the most. But that forces the optimiser to do the timestamp range-seek for each id separately. And since every id likely has a different timestamp value, you've gained nothing.

The second index is what I recommend.

Now, the optimiser can fullfill this part of your query, exceptionally quickly...

SELECT
  o.*
FROM
  orders
WHERE
      orders.timestamp >= '2012-01-01'
  AND orders.timestamp <  '2012-02-01'
ORDER BY
  orders.timestamp ASC

As it happens, even the ORDER BY has been optimised with the suggested index. It's already in the order that you want the data to be output. There is no need to re-sort everything after the join.

Then, to fullfill the total != '0.00' requirement, every row in your range is still checked. But you've already narrowed the range down so much that this will probably be fine. (I wont go in to it, but you will likely find it impossible to use indexes in MySQL to optimise this and the timestamp range-seek.)

Then, you have your join. That's optimised by an index you already have (products.order_id). For every record picked out by the snippet above, the optimiser can do an index seek and very quickly identify the matching record(s).

This all assumes that, in the vast majority of cases, every order row has one or more product rows. If, for example, only a very select few orders had any product rows, it may be faster to pick out the product rows of interest first; essentially looking at the joins happening in reverse order.

The optimiser actually makes that decision for you, but it's handy to know that it's doing that, then provide the indexes you estimate will be most useful to it.

You can check the explain plan to see if the indexes are being used. If not, your attempt to help was ignored. Probably because of the statistics of the data implying a different order of joining was better. If so you can then provide indexes to help that order of joins instead.

edited Oct 16, 2012 at 11:27

answered Oct 16, 2012 at 11:22

MatBailie

87.4k19 gold badges112 silver badges144 bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

dai.hop Over a year ago

This is a fantastic answer - thanks for explaining so thoroughly! The only part I'm a little confused on is adding an index to the timestamp field. I always thought this field was not unique - and therefore incapable of being indexed. Is my understanding incorrect?

MatBailie Over a year ago

@dai.hop - No, a field does not need to be non-unique to be indexed. It is often the default that the Primary Key is indexed (either as a Clustered Index or a Non-Clustered Index).

dai.hop Over a year ago

Excellent. I've applied the index as suggested, the query now runs in 1.9 seconds - a significant improvement!

MatBailie Over a year ago

@dai.hop - Not as much as I'd have expected though. Did you make the other changes too? (To the query itself?)

dai.hop Over a year ago

Yes, I ran the final answer after indexing the timestamp field. My index information now looks like this [screenshot](dropbox.com/s/yhtos9fyx2phjey/Screen Shot 2012-10-17 at 14.06.48.png)

|

podiluska · Accepted Answer · 2012-10-16 10:51:21Z

2

Use Explain to indicate how to optimise the query. I'd suggest starting with indices on Total and TimeStamp
You may find removing the date function improves performance.
You should use modern syntax.

eg.

SELECT o.*, p.name, p.amount, p.quantity  
FROM orders o
     inner join products p  
     on o.id = p.order_id 
WHERE o.total != '0.00' 
AND o.timestamp BETWEEN '2012-01-01' AND '2012-01-31 23:59'  
ORDER BY o.timestamp ASC

answered Oct 16, 2012 at 10:51

podiluska

51.6k7 gold badges102 silver badges105 bronze badges

2 Comments

MatBailie Over a year ago

I think you should explain why removing the DATE() call may help. And I disagree with BETWEEN '2012-01-01' AND '2012-01-31 23:59'. What about '2012-01-31 23:59:59'? Or '2012-01-31 23:59:59.99'? Don't use BETWEEN on datetime ranges, instead use o.timestamp >= '2012-01-01' AND o.timestamp < '2012-02-01'.

dan Over a year ago

+1 I agree with @Dems that BETWEEN should be used with care, but can be used even to compare dates. But +1 is for explain, that is crucial in solving performance issues.

roman · Accepted Answer · 2012-10-16 11:02:04Z

2

I'm not MySQL expert (more SQL Server) by I think you'd better have index on o.timestamp and you need to rewrite your query like this

o.timestamp >= '2012-01-01' and o.timestamp <= '2012-01-31' + INTERVAL 1 DAY

The logic is - index will not work if you compare some expression on column and constants. You need to compare column and constants

edited Oct 16, 2012 at 11:02

answered Oct 16, 2012 at 10:43

roman

118k30 gold badges205 silver badges209 bronze badges

4 Comments

MatBailie Over a year ago

Except that this specific answer changes the functionality. What abut a timestamp of '2012-01-31 23:59:59.99'? Don't use BETWEEN at all, instead use... o.timestamp >= '2012-01-01' AND o.timestamp < '2012-02-01'.

roman Over a year ago

yes, you right, you need to get two constants, cut off time and then add one day to ending date, the answer was more about the idea. I'll change an answer

MatBailie Over a year ago

There is no DATEADD() in MySQL. It's + INTERVAL 1 DAY.

roman Over a year ago

:) ooops sorry again the mistake with function over column is very frequent and I've just wanted to show that this should be changed

Shabarinath Volam · Accepted Answer · 2012-10-16 10:49:44Z

1

SELECT *:

Selecting all columns with the * wildcard will cause the query's meaning and behavior to change if the table's schema changes, and might cause the query to retrieve too much data.

The != operator is non-standard:

Use the <> operator to test for inequality instead.

Aliasing without the AS keyword: Explicitly using the AS keyword in column or table aliases, such as "tbl AS alias," is more readable than implicit aliases such as "tbl alias".

answered Oct 16, 2012 at 10:49

Shabarinath Volam

7475 gold badges19 silver badges49 bronze badges

2 Comments

MatBailie Over a year ago

Good pointers for best practices, but completely unrelated to performance in this case.

dai.hop Over a year ago

Thanks for the tip. I will keep this in mind when writing queries from now on.

Collectives™ on Stack Overflow

MySQL Query Optimisation - JOIN?

4 Answers 4

8 Comments

2 Comments

4 Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

8 Comments

2 Comments

4 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related