I have a MySQL table where an indexed INT column is going to be 0 for 90% of the rows. If I change those rows to use NULL instead of 0, will they be left out of the index, making the index about 90% smaller?
6 Answers
http://dev.mysql.com/doc/refman/5.0/en/is-null-optimization.html
MySQL can perform the same optimization on col_name IS NULL that it can use for col_name = constant_value. For example, MySQL can use indexes and ranges to search for NULL with IS NULL.
1 Comment
It looks like it does index the NULLs too.
Be careful when you run this because MySQL will LOCK the table for WRITES during the index creation. Building the index can take a while on large tables even if the column is empty (all nulls).
4 Comments
Allowing a column to be null will add a byte to the storage requirements of the column. This will lead to an increased index size which is probably not good. That said if a lot of your queries are changed to use "IS NULL" or "NOT NULL" they might be overall faster than doing value comparisons.
My gut would tell me not null, but there's one answer: test!
4 Comments
No, it will continue to include them, but don't make too many assumptions about what the consequences are in either case. A lot depends on the range of other values (google for "cardinality").
MSSQL has a new index type called a "filtered index" for this type of situation (i.e. includes records in the index based on a filter). dBASE-type systems used to have a similar capability, and it was pretty handy.
Comments
Each index has a cardinality means how many distinct values are indexed. AFAIK it's not a reasonable idea to say indexes repeat the same value for many rows but the index will only addresses a repeated value to the clustered index of many rows (rows having null value for this field) and keeping the reference ID of the clustered index means : each row with a NULL value indexed field wastes a size as large as the PK (for this reason experts recommend to have a reasonable PK size if you have composite PK).
Comments
This question leaves out a lot of detail like which storage engine you're using. Assuming you're using the more popular InnoDB - this doc goes into detail about the different row formats.
https://dev.mysql.com/doc/refman/8.4/en/innodb-row-format.html
Now you need to determine/decide which row format you're using/you want to use. Let's assume you're using the default, which is DYNAMIC (as decided by https://dev.mysql.com/doc/refman/8.4/en/innodb-parameters.html#sysvar_innodb_default_row_format).
Each index record contains a 5-byte header that may be preceded by a variable-length header. The header is used to link together consecutive records, and for row-level locking.
The variable-length part of the record header contains a bit vector for indicating
NULLcolumns. If the number of columns in the index that can beNULLisN, the bit vector occupiesCEILING(N/8)bytes. (For example, if there are anywhere from 9 to 16 columns that can beNULL, the bit vector uses two bytes.) Columns that areNULLdo not occupy space other than the bit in this vector. The variable-length part of the header also contains the lengths of variable-length columns. Each length takes one or two bytes, depending on the maximum length of the column. If all columns in the index areNOT NULLand have a fixed length, the record header has no variable-length part.For each non-
NULLvariable-length field, the record header contains the length of the column in one or two bytes. Two bytes are only needed if part of the column is stored externally in overflow pages or the maximum length exceeds 255 bytes and the actual length exceeds 127 bytes. For an externally stored column, the 2-byte length indicates the length of the internally stored part plus the 20-byte pointer to the externally stored part. The internal part is 768 bytes, so the length is 768+20. The 20-byte pointer stores the true length of the column.The record header is followed by the data contents of non-
NULLcolumns.
So with this we can conclude that:
There is some special extra space reserved per-index for InnoDB to store information about which columns in that index can potentially include NULL values.
- You can calculate the space taken here by using the formula given and when you know how many indexes you have, and how many nullable columns you have per-index.
The part of the record header that contains the length of the column will only include data for records where the value is not NULL.
The storage that follows the record header is only used for the data contents of non-NULL columns.
- This means that the NULL value itself is not stored per-column, per-row.
So the final conclusion here is that for NULL values, you save a lot of space because the only space needed is to store per-index how many columns are NULL-able.