Optimising query with large WHERE IN and Date clause

Question

I have a query similar to:

SELECT
    ANY_VALUE(name) AS `name`,
    100 * SUM(score) / SUM(sum(score)) OVER (PARTITION BY date(scores.created_at)) AS `average_score`,
    ANY_VALUE(DATE_FORMAT(scores.created_at, "%Y-%m-%d")) AS `shift_date`
FROM
    `scores`
    INNER JOIN `shifts` ON `shifts`.`id` = `scores`.`shift_id`
WHERE
    `shifts`.`table_c_id` in(1, 2, 3, 4, 5, 6, 7, 8, 9, 10……)
    AND date(`scores`.`created_at`) >= '2020-01-01'
GROUP BY
    `name`,
    date(scores.created_at)
ORDER BY
    `shift_date` ASC;

The where in can be up to 2000 IDs which may not be sequential and the created_at where can be up to 14 months ago. Currently, at those levels, the execution time is 10-20 seconds.

I'm trying to optimise this. I've tried adding an index on created_at on the scores table but that had no effect. I also tried changing the date where clause to:

AND `scores`.`created_at` >= '2020-01-01 00:00:00

Which again made no difference.

Having read up on the topic, some recommended creating a temporary table but I can't see how this would have any benefit. I'm also not sure how to do this in one (is it even possible?) query.

The indexes on scores table are: shift_id, employee_id, name,created_at (used for another query). As I said, a created_at index didn't help this one.

The shifts table has indexes on table_c_id and created_at

Some sites suggest using WITH and CTEs, but again, I'm not sure how this would work or if the performance would actually improve.

The schema for scores and shifts is:

DROP TABLE IF EXISTS `scores`;
CREATE TABLE `scores` (
  `id` bigint unsigned NOT NULL AUTO_INCREMENT,
  `shift_id` int unsigned NOT NULL,
  `hash` varchar(40) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci NOT NULL,
  `name` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci NOT NULL,
  `sscore` double(8,2) unsigned NOT NULL,
  `created_at` timestamp NULL DEFAULT NULL
  PRIMARY KEY (`id`),
  KEY `scores_hash_index` (`hash`) USING BTREE,
  KEY `scores_shift_id_index` (`shift_id`) USING BTREE,
  KEY `scores_name_created_at_index` (`name`,`created_at`)
) ENGINE=InnoDB AUTO_INCREMENT=3140922 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;

DROP TABLE IF EXISTS `shifts`;
CREATE TABLE `shifts` (
  `id` bigint unsigned NOT NULL AUTO_INCREMENT,
  `table_c_id` int unsigned NOT NULL,
  `created_at` timestamp NULL DEFAULT NULL,
  `updated_at` timestamp NULL DEFAULT NULL,
  PRIMARY KEY (`id`),
  KEY `shifts_table_c_id_index` (`table_c_id`),
  KEY `shifts_created_at_index` (`created_at`) USING BTREE
) ENGINE=InnoDB AUTO_INCREMENT=536392 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;

Update

Using a lookup table for names:

names table: int unsigned, id, primary; varchar, name

SELECT
    names.name AS `name`,
    100 * SUM(score) / SUM(sum(score)) OVER (PARTITION BY date(scores.created_at)) AS `average_score`,
    ANY_VALUE(DATE_FORMAT(scores.created_at, "%Y-%m-%d")) AS `shift_date`
FROM
    `scores`
    INNER JOIN `shifts` ON `shifts`.`id` = `scores`.`shift_id`
    INNER JOIN `names` ON `names`.id = `scores`.`name_id`
WHERE
    `shifts`.`table_c_id` in(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254, 255, 256, 257, 258, 259, 260, 261, 262, 263, 264, 265, 266, 267, 268, 269, 270, 271, 272, 273, 274, 275, 276, 277, 278, 279, 280, 281, 282, 283, 284, 285, 286, 287, 288, 289, 290, 291, 292, 293, 294, 295, 296, 297, 298, 299, 300, 301, 302, 303, 304, 305, 306, 307, 308, 309, 310, 311, 312, 313, 314, 315, 316, 317, 318, 319, 320, 321, 322, 323, 324, 325, 326, 327, 328, 329, 330, 331, 332, 333, 334, 335, 336, 337, 338, 339, 340, 341, 342, 343, 344, 345, 346, 347, 348, 349, 350, 351, 352, 353, 354, 355, 356, 357, 358, 359, 360, 361, 362, 363, 364, 365, 366, 367, 368, 369, 370, 371, 372, 373, 374, 375, 376, 377, 378, 379, 380, 381, 382, 383, 384, 385, 386, 387, 388, 389, 390, 391, 392, 393, 394, 395, 396, 397, 398, 399, 400, 401, 402, 403, 404, 405, 406, 407, 408, 409, 410, 411, 412, 413, 414, 415, 416, 417, 418, 419, 420, 421, 422, 423, 424, 425, 426, 427, 428, 429, 430, 431, 432, 433, 434, 435, 436, 437, 438, 439, 440, 441, 442, 443, 444, 445, 446, 447, 448, 449, 450, 451, 452, 453, 454, 455, 456, 457, 458, 459, 460, 461, 462, 463, 464, 465, 466, 467, 468, 469, 470, 471, 472, 473, 474, 475, 476, 477, 478, 479, 480, 481, 482, 483, 484, 485, 486, 487, 488, 489, 490, 491, 492, 493, 494, 495, 496, 497, 498, 499, 500, 501, 502, 503, 504, 505, 506)
    AND `scores`.`created_at` >= '2019-04-03'
GROUP BY
    `names`.`name`,
    date(scores.created_at)
ORDER BY
    `shift_date` ASC;

Has given no benefit. Also an index on scores table for shift_id, name_id and created_at hasn't helped.

Would be better to write a subquery and use BETWEEN instead of IN. — zealous
– zealous, Commented Apr 16, 2020 at 2:35
The in IDs May not always be sequential. Wil update the post. — Shiv
– Shiv, Commented Apr 16, 2020 at 2:42
Show DDLs for both tables. Replace ANY_VALUE(DATE_FORMAT(scores.created_at, "%Y-%m-%d")) with date(scores.created_at). — Akina
– Akina, Commented Apr 16, 2020 at 3:47

Rick James · Accepted Answer · 2020-04-16 04:01:04Z

0

Plan A: Avoid windowing functions (many of them are slower than one would think.)

SELECT
    ANY_VALUE(brand_name),
    100 * SUM(score) / init.tot AS `average_score`,
    DATE(scores.created_at) AS `shift_date`
FROM
    `scores`
INNER JOIN `shifts` ON `shifts`.`id` = `scores`.`shift_id`
JOIN ( SELECT SUM(score) AS tot FROM shifts
        WHERE table_c_id IN (...)
          AND `created_at` >= '2020-01-01' ) AS init
WHERE
    `shifts`.`table_c_id` in(1, 2, 3, 4, 5, 6, 7, 8, 9, 10……)
    AND `scores`.`created_at` >= '2020-01-01'
GROUP BY
    shift_date,
    `brand_name`
ORDER BY
    `shift_date` ASC;

Notes:

several changes with the state syntax.
I assumed that name and brand_name were the same
By flipping the GROUP BY order, it may avoid a second sort.
I used a derived table to compute the grand total, thereby obviating the need for OVER.
This composite, covering, index on scores may help:
```
INDEX(created_at, shift_id)
```

Plan B: Use a CTE to compute SUM(score), then finish the query.

answered Apr 16, 2020 at 4:01

Rick James

144k15 gold badges144 silver badges254 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Shiv Over a year ago

That is a lot quicker, but the result is incorrect. stackoverflow.com/questions/61169223/… is the question I made about the calculation - it seems to be a weird one!

Shiv Over a year ago

The reason yours didn't work is that init.tot is the total of all, whilst the window function is only aggregating by date

Rick James Over a year ago

@Shiv - Then the subquery needs a group by DATE. Or something like that. (I don't understand the intent of the original query, so I probably mangled it.)

Shiv Over a year ago

Having to add the group by and then a where to check the date is correct on the outer query slowed it down dramatically

Collectives™ on Stack Overflow

Optimising query with large WHERE IN and Date clause

Update

1 Answer 1

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

Update

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related