1

I'm currently working on a web service that supports multiple databases. I'm trying to optimize tables and fix missing indexes. The following is the MySQL query:

SELECT 'UTC' AS timezone, pak.id AS package_id, rel.unique_id AS relay, sns.unique_id AS sensor, pak.rtime AS time,
   sns.units AS sensor_units, typ.name AS sensor_type, dat.data AS sensor_data,
   loc.altitude AS altitude, Y(loc.location) AS latitude, X(loc.location) as longitude,
   loc.speed as speed, loc.climb as climb, loc.track as track,
   loc.longitude_error as longitude_error, loc.latitude_error as latitude_error, loc.altitude_error as altitude_error,
   loc.speed_error as speed_error, loc.climb_error as climb_error, loc.track_error as track_error
 FROM sensor_data dat
 LEFT OUTER JOIN package_location loc on dat.package_id = loc.package_id
 LEFT OUTER JOIN data_package pak ON dat.package_id = pak.id
 LEFT OUTER JOIN relays rel ON pak.relay_id = rel.id
 LEFT OUTER JOIN sensors sns ON dat.sensor_id = sns.id
 LEFT OUTER JOIN sensor_types typ ON sns.sensor_type = typ.id
 WHERE typ.name='Temperature'
   AND rel.unique_id='OneWireTester'
   AND pak.rtime > '2015-01-01'
   AND pak.rtime < '2016-01-01'

and the explanation...

+----+-------------+-------+--------+------------------------------------------+----------------------+---------+------------------------+------+----------------------------------------------------+
| id | select_type | table | type   | possible_keys                            | key                  | key_len | ref                    | rows | Extra                                              |
+----+-------------+-------+--------+------------------------------------------+----------------------+---------+------------------------+------+----------------------------------------------------+
|  1 | SIMPLE      | rel   | ALL    | PRIMARY                                  | NULL                 | NULL    | NULL                   |    5 | Using where                                        |
|  1 | SIMPLE      | pak   | ref    | PRIMARY,fk_package_relay_id              | fk_package_relay_id  | 9       | BigSense.rel.id        |    1 | Using index condition; Using where                 |
|  1 | SIMPLE      | dat   | ref    | fk_sensor_package_id,fk_sensor_sensor_id | fk_sensor_package_id | 9       | BigSense.pak.id        |    1 | NULL                                               |
|  1 | SIMPLE      | sns   | eq_ref | PRIMARY,fk_sensors_type_id               | PRIMARY              | 8       | BigSense.dat.sensor_id |    1 | NULL                                               |
|  1 | SIMPLE      | loc   | eq_ref | PRIMARY                                  | PRIMARY              | 8       | BigSense.pak.id        |    1 | NULL                                               |
|  1 | SIMPLE      | typ   | ALL    | PRIMARY                                  | NULL                 | NULL    | NULL                   |    5 | Using where; Using join buffer (Block Nested Loop) |
+----+-------------+-------+--------+------------------------------------------+----------------------+---------+------------------------+------+----------------------------------------------------+

...seems pretty straight forward. I need to add an index on the relays table and the sensor_types to optimize the query.

The tables for the PostgreSQL version are pretty much identical. However when I use the following query:

SELECT 'UTC' AS timezone, pak.id AS package_id, rel.unique_id AS relay, sns.unique_id AS sensor, pak.rtime AS time,
       sns.units AS sensor_units, typ.name AS sensor_type, dat.data AS sensor_data,
       loc.altitude AS altitude, ST_Y(loc.location::geometry) AS latitude, ST_X(loc.location::geometry) as longitude,
       loc.speed as speed, loc.climb as climb, loc.track as track,
       loc.longitude_error as longitude_error, loc.latitude_error as latitude_error, loc.altitude_error as altitude_error,
       loc.speed_error as speed_error, loc.climb_error as climb_error, loc.track_error as track_error
FROM sensor_data dat
LEFT OUTER JOIN package_location loc on dat.package_id = loc.package_id
LEFT OUTER JOIN data_package pak ON dat.package_id = pak.id
LEFT OUTER JOIN relays rel ON pak.relay_id = rel.id
LEFT OUTER JOIN sensors sns ON dat.sensor_id = sns.id
LEFT OUTER JOIN sensor_types typ ON sns.sensor_type = typ.id
WHERE typ.name='Temperature'
  AND rel.unique_id='OneWireTester'
  AND pak.rtime > '2015-01-01'
  AND pak.rtime < '2016-01-01';

If I do an explain analyze, I get the following:

    QUERY PLAN                                                                          
-------------------------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop Left Join  (cost=36.23..131.80 rows=1 width=477) (actual time=0.074..3.933 rows=76 loops=1)
   ->  Nested Loop  (cost=36.09..131.60 rows=1 width=349) (actual time=0.068..3.782 rows=76 loops=1)
         ->  Nested Loop  (cost=35.94..130.58 rows=4 width=267) (actual time=0.062..2.472 rows=620 loops=1)
               ->  Hash Join  (cost=35.67..128.73 rows=4 width=247) (actual time=0.053..0.611 rows=620 loops=1)
                     Hash Cond: (dat.sensor_id = sns.id)
                     ->  Seq Scan on sensor_data dat  (cost=0.00..89.46 rows=946 width=21) (actual time=0.007..0.178 rows=1006 loops=1)
                     ->  Hash  (cost=35.64..35.64 rows=2 width=238) (actual time=0.037..0.037 rows=11 loops=1)
                           Buckets: 1024  Batches: 1  Memory Usage: 1kB
                           ->  Hash Join  (cost=20.68..35.64 rows=2 width=238) (actual time=0.019..0.035 rows=11 loops=1)
                                 Hash Cond: (sns.sensor_type = typ.id)
                                 ->  Seq Scan on sensors sns  (cost=0.00..13.60 rows=360 width=188) (actual time=0.002..0.005 rows=31 loops=1)
                                 ->  Hash  (cost=20.62..20.62 rows=4 width=66) (actual time=0.010..0.010 rows=1 loops=1)
                                       Buckets: 1024  Batches: 1  Memory Usage: 1kB
                                       ->  Seq Scan on sensor_types typ  (cost=0.00..20.62 rows=4 width=66) (actual time=0.006..0.008 rows=1 loops=1)
                                             Filter: ((name)::text = 'Temperature'::text)
                                             Rows Removed by Filter: 4
               ->  Index Scan using data_package_pkey on data_package pak  (cost=0.28..0.45 rows=1 width=20) (actual time=0.002..0.002 rows=1 loops=620)
                     Index Cond: (id = dat.package_id)
                     Filter: ((rtime > '2015-01-01 00:00:00'::timestamp without time zone) AND (rtime < '2016-01-01 00:00:00'::timestamp without time zone))
         ->  Index Scan using relays_pkey on relays rel  (cost=0.14..0.24 rows=1 width=94) (actual time=0.002..0.002 rows=0 loops=620)
               Index Cond: (id = pak.relay_id)
               Filter: ((unique_id)::text = 'OneWireTester'::text)
               Rows Removed by Filter: 1
   ->  Index Scan using package_location_pkey on package_location loc  (cost=0.14..0.18 rows=1 width=140) (actual time=0.001..0.001 rows=0 loops=76)
         Index Cond: (dat.package_id = package_id)
 Planning time: 0.959 ms
 Execution time: 4.030 ms
(27 rows)

The table schema has the same foreign keys and general structure, so I'd expect to see the same indexes required. However I've been looking through several guides on pgsql's examine statement and from what I've gathered, the Seq Scan statements are indicators of missing indexes, meaning I am missing indexes on sensors, sensor_data, and sensor_type.

Am I interpreting the results of these examine statements correctly? What should I be looking for in order to optimize both databases?

3
  • and from what I've gathered, the Seq Scan statements are indicators of missing indexes, wrong. Commented Jul 29, 2015 at 23:08
  • 1
    ... and you should not try to optimize small queries. The resulting total time is 4 msec, smaller than one disk seek. Anything goes ... Commented Jul 29, 2015 at 23:26
  • Please read stackoverflow.com/q/15474812/398670 . Short version: a seqscan is often faster on a small table, or one where you want to fetch most of the rows anyway. Commented Jul 30, 2015 at 7:50

2 Answers 2

0

In PostgreSQL (and probably MySQL as well) indexes are not used simply because they are defined, they are used when that would speed up the query.

In the EXPLAIN ANALYZE output you see, between parentheses, a section on cost followed by a similar section on actual time. The query planner looks at the cost, which is defined by a number of parameters that are listed in the configuration file. These costs are things like IO and CPU time, with the former typically having a much higher value than the latter (typically a factor of 100 difference). This means that the query planner tries to minimize the amount of data that needs to be read from disk, which goes by pages of a pre-determined size (typically 4kB), not by individual rows (this is because this allows for much faster access due to the physical characteristics of hard disk drives). Both the table itself and the index are stored on disk. If the table is small, it will fit in a few pages, maybe even just a single page. Since CPU time is cheap compared to IO time, it is much faster to sequentially scan a few pages than to have the extra IO of reading the disk page with the index.

As you can tell from the EXPLAIN ANALYZE output, most of your tables are small and will fit on a handful of pages. If you really want to test the functionality of the indexes you should load your tables with a million or so rows of random data and then do the testing.

Sign up to request clarification or add additional context in comments.

Comments

0

There is a nice tool called Explain Depesz (https://explain.depesz.com) which you can drop your explain analyze output and it will visually present to you what is happening to your query. For example, I will run there your output:

Explain Depesz output

  • The first column (exclusive), is the time (milisseconds), of the row exclusivelly, not counting the nested steps;
  • The second column (inclusive), is also the time, but it is the sum of all the nested steps of the row;
  • Third column (rows x), shows how many times the planner under- or over- estimated number of rows the given rown will return.
  • Fourth column (rows), shows how many rows the node actually returned;
  • Last column (loops), shows how many times the given node ran;

So, in your example, as we can see, the main problem of your query, is the index scan your doing in lines 11 and 12. Seems to me you should create better indexes, because even with them, they are running 620 times each.

Maybe your index should be in the rtime column for the data_package_pkey table. That's not the only thing. Optimizing queries involve multiple steps. I had many cases where the answers was to plan better the query.

For example, you could get first all the queries from the year you are looking.Then, with less columns, you could look for the unique_id.

Therefore, your question was 9 years ago. You probably solved all your problems, and you might be a master in PostgreSQL explain by now. However, I hope my answer could bring light to others.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.