How can I make queries on a dynamic part of my data fast?

Question

The Problem

I have a query that will always only hit the data from the past two weeks. It is business critical, so it must be fast. The table is inserted into very many times per hour. The table is huge, so I am very reluctant to write a non-filtered index for the sake of such a relatively small amount of data. What I really want to write is

CREATE NONCLUSTERED INDEX [FIX_MyTable] ON [MyTable]
(
   [My],
   [Stupid],
   [Columns]
)
WHERE [StupidDate] > DATEADD(Day, -14, GETDATE())

but non-deterministic filters are illegal.

What I've Tried

As above, filtered indexes are not applicable. Playing around with a columnstore index's COMPRESSION_DELAY might be helpful, but is not a solution in of itself.
Partitioning doesn't solve it. Like with filtered indexes, you cannot make a partitioning schema have dynamic boundaries.
Indexed views can't solve it because they must be deterministic.
Computed columns don't work. They don't inter-operate with anything useful, so they're no better than just indexing [StupidDate]
Caching in the application is a partial solution at best and tempts towards the expensive non-filtered index solution.
I could create a script to drop and redefine my filtered index each day, but I expected to find an easier path. For the same reason, I have ruled out any trickery with copying the table or writing triggers.

Charlieface · Accepted Answer · 2025-02-20 15:08:48Z

The table is huge, so I am very reluctant to write a non-filtered index for the sake of such a relatively small amount of data.

I agree with Steve, that unless the storage poses a problem, a regular nonclustered index should be the go-to.

Compromise

But if you're still reluctant and want to save on storage, then create a filtered nonclustered index with a hardcoded static date. Since two weeks of data is a significant minority of the data according to you, then a few months or even a year is probably still relatively trivially small, as it accumulates more data over time.

Let the index start out as only two weeks of data and when it grows significantly enough, say in a year or even a few years, then re-evaluate if it's worth keeping as is, or re-creating with another statically filtered index with the hardcoded date two weeks into the past of that point in time. Creating a new index and dropping the old one should be relatively simple at that point.

This gives you mostly the best of all perspectives:

Index needed for performance tuning queries that utilize data in the last two weeks
Measurably less storage consumed for maintaining that index
No need to drop and re-create said index every day for the sake of rolling 1 days worth of data (a tiny amount) off the index
No need to write automation scripts to do the above

One thing to keep in mind is for the filtered index to be applicable to any queries, the correlating date filter in those queries needs to be a constant (as opposed to a parameter or variable). Filtered indexes, whose filters are constant themselves, are only able to be utilized against predicates who contain an applicable constant expression.

If the query does use a variable or parameter to filter on the date, then an applicable constant date filter should be added to the query (despite being possibly redundant logically) so that it's sargable for the filtered index. Note this constant date can be any date within the range of the filtered index's expression, and doesn't necessarily need to match the variable or parameter value.

This might be difficult if the filtered index is constantly changing. Using OPTION (RECOMPILE) is another option, at the cost of recompiling every time.

Paul White · Accepted Answer · 2025-02-20 16:34:44Z

Your other option is to have two tables, e.g. MyTable_Hot and MyTable_Cold.

Replace the original table with a view concatenating the two tables.

This gives you complete freedom over indexing. You can choose to move rows from hot to cold whenever that is convenient, and in whatever batch size suits.

Results from querying the view will always be correct, assuming you write the (one-way) data movement process correctly. DELETE with OUTPUT straight into the cold table is a popular choice.

Critical queries that need to access hot data only can use that table directly (or another view).

Whether this is a practical proposition for you depends on details that are not present in the question, but it is an option in general.

Steve · Accepted Answer · 2025-02-19 00:38:44Z

I could create a script to drop and redefine my filtered index each day, but I expected to find an easier path. For the same reason, I have ruled out any trickery with copying the table or writing triggers.

The problem seems to be, if you are going to store indexing data based on a date column and have that index filtered according to a threshold that is entirely relative to today's date, then there must be a scheduled maintenance procedure that updates the index and purges the oldest part of the index as the clock itself progresses (or when the clock is synchronised/manually adjusted).

That is, as the calendar day turns or whenever the hardware clock is reset, for that reason alone your index must be immediately maintained, even if nothing else has happened in the database.

This is also the root of the "non-determinism" (i.e. volatility) in this case, because the present state of the clock is an implicit and varying input (in a data flow sense) into the GETDATE() function.

... the expensive non-filtered index solution.

Exactly how expensive is the solution, relatively speaking? Is it a completely unacceptable expense even for a "business critical" query that "must be fast", or are you only fussing?

A non-filtered index is clearly the easy solution you seek, from a programming perspective.

Yes, it does potentially accumulate entries which as they age become redundant for the intended purpose of this index, but that is the price of having a fully-static index which can be maintained upon the insert of each row and does not require further scheduled maintenance.

The threshold of values which are looked up using this index - a threshold that varies daily - is then computed at query-time upon each and every execution and according to the current state of the calendar/clock, rather than being baked into the index itself.

Stack Exchange Network

How can I make queries on a dynamic part of my data fast?

The Problem

What I've Tried

3 Answers 3

Compromise

Your Answer

Hot Network Questions

How can I make queries on a dynamic part of my data fast?

The Problem

What I've Tried

3 Answers 3

Compromise

Your Answer

Sign up or log in

Post as a guest

Related

Hot Network Questions