SQL Server Clustered Index on Non-Unique Column

Question

I've been trying to read up on the use of clustered indexes as a means of improving query performance.

Essentially, I have a 'Messages' table, which is used in a chat application. When the user opens a chat, we read the message history from the table based on the customer the user is speaking to.

The messages table is structured with the columns:

Id : CustomerId : Content

The query most often used on the table will be something like

SELECT * FROM Message WHERE CustomerId = @CustomerId

My question is, is the CustomerId column a suitable candidate for a clustered index? Also, given that the ID field is the primary key, will SQL Server still need to 'uniqueify' the clustered index?

The table will be heavy on both selects and inserts.

Gordon Linoff · Accepted Answer · 2016-09-14 12:30:53Z

The clustered index does not need to be unique, so it is possible.

However, the issue is that each time a new message is inserted, SQL Server needs to find a space for the new row next to the other rows for the same customer. This can often be inefficient, because pages need to be split, resulting in many half-filled pages. And, things get even more complicated if you have deletes on the rows as well.

There are several options. In a busy database, you can leave room on the pages for additional inserts. Or, another option is to partition the table based on the customer id. It all depends.

Under most circumstances, an identity column on the messages table would be the primary key and the clustered key as well. An additional index on the customer table would be sufficient. But, there are definitely alternative structures that can work better in some scenarios.

Remus Rusanu · Accepted Answer · 2016-09-14 12:40:56Z

0

Adding a clustered index on a key on which there is already a primary key constrain is an unnecessary duplication. Instead the primary key constrain should also be the clustered index. However, your question is actually different...

is the CustomerId column a suitable candidate for a clustered index?

One cannot answer this without knowing how you will be querying the table. There are numerous query patterns for which this organization will not be optimal (typical examples being time series where the time column is the appropriate clustered key). And there are just as many examples of queries for which this is the best clustered index. Unanswerable.

given that the ID field is the primary key, will SQL Server still need to 'uniqueify' the clustered index?

If the index is not declared unique then SQL Server will add the uniquifier column. The column value will never materialize though, as no duplicates will ever occur.

I have a 'Messages' table, which is used in a chat application ... WHERE CustomerId = @Id ... the ID field is the primary key

Pardon me, but this does not make any sense. What you're saying is that the Messages table can have only one message from each customer. That would make a horrible chat experience. I'm pretty sure your explanation is wrong.

I would expect a Customers table with clustered index and primary key constraint on CustomerId. The Messages table is likely to be organized by chat room, or some other group organization that pairs the chat participants. If the chats are always between one and exactly one Customer and a representative, then the 'chat room' may be the customer itself. In any way, the typical querying of such a Messages table would want all the Messages exchanged in a chat room, in the order posted or all the Messages exchanged with a customer, in the order posted. This is, actually, a partitioned time series and best served by a clustered index like (chat_id, post_time) or (customer_id, post_time). Note that this is not the primary key, the table may well have a message_id as primary key, but non-clustered.

answered Sep 14, 2016 at 12:40

Remus Rusanu

296k42 gold badges459 silver badges583 bronze badges

2 Comments

Remus Rusanu Over a year ago

@GordonLinoff makes a valid point that a clustered key like (chat_id, post_time) leads to fragmentation as data is inserted at multiple points in the index, but in my experience this, when it becomes a problem, is solvable with simple reorganize, or a rebuild (eventually with a low fill_factor). Choosing instead a clustered index that ensures append-at-end-only semantics for the physical organization and penalizing queries in the process is usually far worse.

Callum Evans Over a year ago

'@Id' was supposed to be the customer id as an argument passed into a stored procedure. So, you would lookup all messages for a customer by their Id. It was a poorly written example. Yes, in this case the 'customer id' is basically the chat room identifier.

Collectives™ on Stack Overflow

SQL Server Clustered Index on Non-Unique Column

2 Answers 2

Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related