optimize secondary index and date range query

Question

I am running an aggregate query that is taking much longer than expected. The query is from a single table without joins. The where clause includes a date range, an in clause, and a date column. There are only about 5k rows in the table, and the query time is 13s.

The query is:

select `site_id`, created_year_month_idx as time_column, count(*) as total 
from `patients` 
where `created_year_month_idx` between 20080101 and 20090101 and 
   `site_id` in (1,2,3) and 
   `patients`.`deleted_at` is null 
group by `created_year_month_idx`, `site_id`

When I explain the query, it seems to be doing a whole table scan:

| id  | select_type | table    | partitions | type  | possible_keys                                 | key                                   | key_len | ref | rows | filtered | Extra                                        |
| --- | ----------- | -------- | ---------- | ----- | --------------------------------------------- | ------------------------------------- | ------- | --- | ---- | -------- | -------------------------------------------- |
| 1   | SIMPLE      | patients |            | range | site_id,patients_created_year_month_idx_index | patients_created_year_month_idx_index | 4       |     | 1    | 100      | Using where; Using temporary; Using filesort |

The table create statements are:

CREATE TABLE `sites` (
 `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
 `name` varchar(10),
 PRIMARY KEY (`id`)
);

CREATE TABLE `patients` (
 `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
 `site_id` int(10) unsigned NOT NULL,
 `created_at` timestamp NULL DEFAULT NULL,
 `deleted_at` timestamp NULL DEFAULT NULL,
 `created_year_month_idx` date GENERATED ALWAYS AS (date_format(`created_at`,'%Y-%m-01')) VIRTUAL,
 PRIMARY KEY (`id`),
 KEY `site_id` (`site_id`),
 KEY `patients_created_year_month_idx_index` (`created_year_month_idx`),
 CONSTRAINT `patients_site` FOREIGN KEY (`site_id`) REFERENCES `sites` (`id`)
);

I created a DB Fiddle at https://www.db-fiddle.com/f/4zbjFpMYXEGSviprQcaTm3/0

(incidentally, if you can tell me how to format a markdown table on SO, I'll fix the above)

To my my naive eye, an index on (site_id,created_year_month_idx), optionally including deleted_at, seems sensible. Incidentally, it's often as quick to try these things for yourself as ask us! But +1 for providing required info — Strawberry
– Strawberry, Commented Feb 11, 2020 at 22:12
@Strawberry - deleted_at will be null most for most records and will be a date for those records marked for deletion. About 10% of records will have a value for deleted_at and 90% will be null. — mankowitz
– mankowitz, Commented Feb 12, 2020 at 3:27
@Strawberry - I have an index on site_id and also on created_year_month_idx. Initially I was concerned that created_year_month_idx would slow things down as it is a generated (not stored) column, but I read that creating the index would store the calculated values and therefore not require a table scan. Are you saying I should combine the indices into one index? — mankowitz
– mankowitz, Commented Feb 12, 2020 at 3:31
I thought so - so maybe amend your table definition accordingly. — Strawberry
– Strawberry, Commented Feb 12, 2020 at 15:06

Rick James · Accepted Answer · 2020-02-12 03:46:44Z

1

I vote for

INDEX(`deleted_at`, `created_year_month_idx`, `site_id`)

But mostly because it is "covering". deleted_at is first since it is essentially an equality test (IS NULL).

Do you realize you have one year plus one day? BETWEEN 20080101 AND 20090101

Do you really want about 1K rows of output?

answered Feb 12, 2020 at 3:46

Rick James

144k15 gold badges144 silver badges254 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

derek.wolfe · Accepted Answer · 2020-02-11 22:13:33Z

0

I can't tell if it is faster since there are no records in the tables and everything finishes in 1ms or less, but try joining to the sites table and then using the primary key for your IN instead of the foreign key, like this:

 SELECT s.id, p.created_year_month_idx AS time_column, COUNT(*) AS total 
 FROM patients p
 JOIN sites s ON s.id = p.site_id
 WHERE p.created_year_month_idx BETWEEN 20080101 AND 20090101 
 AND s.id IN (1,2,3) 
 AND p.deleted_at IS NULL 
 GROUP BY p.created_year_month_idx, s.id

EDIT: The reason the query is slow is because the query planner is not using any of your indexes. The above will use the primary key index.

answered Feb 11, 2020 at 22:13

derek.wolfe

1,1267 silver badges14 bronze badges

4 Comments

Strawberry Over a year ago

Eh? You've lost me there.

derek.wolfe Over a year ago

Which part lost you?

Strawberry Over a year ago

The bit about joining an uneccessary table making it faster,

derek.wolfe Over a year ago

The EXPLAIN on the original query shows that no indexes were being used. The reason for this is that there was no index on all of the columns that are in use on the table. Since only the id from the sites table is in use, joining to that table allows its primary key index to be used. If the OP does not want to alter the table, this may be an option for them. If they are ok with altering the table, adding the index that @Rick James suggested is probably their best bet.

Gordon Linoff · Accepted Answer · 2020-02-11 22:25:17Z

Try this version of the query with the associated indexes:

select site_id, created_year_month_idx as time_column, count(*) as total 
from patients p
where created_year_month_idx` between 20080101 and 20090101 and 
      site_id = 1 and 
      p.deleted_at is null 
group by site_id, created_year_month_idx
union all
select site_id, created_year_month_idx as time_column, count(*) as total 
from patients p
where created_year_month_idx` between 20080101 and 20090101 and 
      site_id = 2 and 
      p.deleted_at is null 
group by site_id, created_year_month_idx
union all
select site_id, created_year_month_idx as time_column, count(*) as total 
from patients p
where created_year_month_idx` between 20080101 and 20090101 and 
      site_id = 3 and 
      p.deleted_at is null 
group by site_id, created_year_month_idx;

Then the index is on patients(site_id, created_year_month_idx, deleted_id).

Collectives™ on Stack Overflow

optimize secondary index and date range query

3 Answers 3

Comments

4 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related