I have a columnar table that is partitioned by day and hour. It is stored on S3 in parquet files to be queried by Athena. Here is the CREATE TABLE:
CREATE EXTERNAL TABLE foo (
-- other columns here
dt timestamp,
day string,
hour string
)
PARTITIONED BY (day string, hour string)
STORED AS parquet
LOCATION 's3://foo/foo'
And the layout on S3 is like:
s3://foo/foo/day=2021-10-10/hh=00/*.parquet
s3://foo/foo/day=2021-10-10/hh=01/*.parquet
...etc
s3://foo/foo/day=2021-10-10/hh=23/*.parquet
So a query like the following will be fast because it only scans over one hour of parquet files because the partition columns are being used to filter it:
-- fast, easy to write
SELECT * FROM foo WHERE day = '2021-10-10' AND hour = '00'
However, the table also includes the full datetime dt. Usually we want to write queries for ranges that don't align to a day/hour boundary, and/or are in a different timezone.
For example, this will scan ALL parquet files and be very slow:
-- slow, easy to write
SELECT * FROM foo WHERE dt > '2021-10-09 23:05:00' AND dt < '2021-10-11 01:00:00'
It can be improved by manually calculating the day and hour that minimally enclose the time period:
-- fast, painful to write
SELECT * FROM foo
WHERE
((day, hh) IN (('2021-10-09', '23'), ('2021-10-11', '00')) OR day = '2021-10-10')
AND
dt > '2021-10-09 23:05:00' AND dt < '2021-10-11 01:00:00'
Ideally this extra condition could be added transparently by the database so as to avoid having to manually add the ((day,hh) IN (...)).
Is this possible somehow with Athena?