I am trying to root cause a customer case where 2 Identical drives, formatted with the same command, led to a difference of ~55GB in total disk space due to additional Inode overhead.
I want to understand
- The math on how 2x
Inodes per grouptranslates to 2xInode count - How does
Inodes per groupget set whenlazy_itable_initflag is used
Environment:
The 2 drives are on 2 identical hardware servers, running on the same exact OS. Here are the details of the 2 drives (Sensitive info redacted):
Drive A:
=== START OF INFORMATION SECTION ===
Vendor: HPE
Product: <strip>
Revision: HPD4
Compliance: SPC-5
User Capacity: 7,681,501,126,656 bytes [7.68 TB]
Logical block size: 512 bytes
Physical block size: 4096 bytes
LU is resource provisioned, LBPRZ=1
Rotation Rate: Solid State Device
Form Factor: 2.5 inches
Logical Unit id: <strip>
Serial number: <strip>
Device type: disk
Transport protocol: SAS (SPL-3)
Local Time is: Mon Apr 25 07:39:27 2022 GMT
SMART support is: Available - device has SMART capability.
Drive B:
=== START OF INFORMATION SECTION ===
Vendor: HPE
Product: <strip>
Revision: HPD4
Compliance: SPC-5
User Capacity: 7,681,501,126,656 bytes [7.68 TB]
Logical block size: 512 bytes
Physical block size: 4096 bytes
LU is resource provisioned, LBPRZ=1
Rotation Rate: Solid State Device
Form Factor: 2.5 inches
Logical Unit id: <strip>
Serial number: <strip>
Device type: disk
Transport protocol: SAS (SPL-3)
Local Time is: Mon Apr 25 07:39:23 2022 GMT
SMART support is: Available - device has SMART capability.
The command run to format the drive is:
sudo mke2fs -F -m 1 -t ext4 -E lazy_itable_init,nodiscard /dev/sdc1
The issue:
The df -h output for Drives A and B respectively shows DriveA with size 6.9T vs Drive B with size 7.0T:
/dev/sdc1 6.9T 89M 6.9T 1% /home/<strip>/data/<serial>
...
/dev/sdc1 7.0T 3.0G 6.9T 1% /home/<strip>/data/<serial>
Observations:
- fdisk output on both drives show they both have identical partitions.
DriveA:
Disk /dev/sdc: 7681.5 GB, 7681501126656 bytes, 15002931888 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 8192 bytes / 8192 bytes
Disk label type: gpt
Disk identifier: 70627C8E-9F97-468E-8EE6-54E960492318
# Start End Size Type Name
1 2048 15002929151 7T Microsoft basic primary
DriveB:
Disk /dev/sdc: 7681.5 GB, 7681501126656 bytes, 15002931888 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 8192 bytes / 8192 bytes
Disk label type: gpt
Disk identifier: 702A42FA-9A20-4CE4-B938-83D3AB3DCC49
# Start End Size Type Name
1 2048 15002929151 7T Microsoft basic primary
/etc/mke2fs.confcontents are identical on both systems, so no funny business here:
================== DriveA =================
[defaults]
base_features = sparse_super,filetype,resize_inode,dir_index,ext_attr
enable_periodic_fsck = 1
blocksize = 4096
inode_size = 256
inode_ratio = 16384
[fs_types]
ext3 = {
features = has_journal
}
ext4 = {
features = has_journal,extent,huge_file,flex_bg,uninit_bg,dir_nlink,extra_isize,64bit
inode_size = 256
}
...
================== DriveB =================
[defaults]
base_features = sparse_super,filetype,resize_inode,dir_index,ext_attr
enable_periodic_fsck = 1
blocksize = 4096
inode_size = 256
inode_ratio = 16384
[fs_types]
ext3 = {
features = has_journal
}
ext4 = {
features = has_journal,extent,huge_file,flex_bg,uninit_bg,dir_nlink,extra_isize,64bit
inode_size = 256
}
- If we take a diff between the tune2fs -l output for both drives, we see
Inodes per groupon DriveA are 2x DriveB - We also see
Inode counton DriveA is 2xDriveB (Full diff HERE)
DriveA:
Inode count: 468844544
Block count: 1875365888
Reserved block count: 18753658
Free blocks: 1845578463
Free inodes: 468843793
...
Fragments per group: 32768
Inodes per group: 8192
Inode blocks per group: 512
Flex block group size: 16
DriveB:
Inode count: 234422272 <----- Half of A
Block count: 1875365888
Reserved block count: 18753658
Free blocks: 1860525018
Free inodes: 234422261
...
Fragments per group: 32768
Inodes per group: 4096 <---------- Half of A
Inode blocks per group: 256 <---------- Half of A
Flex block group size: 16
From How to calculate the "Inode blocks per group" on ext2 file system? I understand
Inode blocks per groupis a result ofInodes per groupFrom the mke2fs code (Source),
Inodes per groupvalue seems to be called in thewrite_inode_tablesfunction only whenlazy_itable_initis provided:
write_inode_tables(fs, lazy_itable_init, itable_zeroed);
...
static void write_inode_tables(ext2_filsys fs, int lazy_flag, int itable_zeroed)
...
if (lazy_flag)
num = ext2fs_div_ceil((fs->super->s_inodes_per_group - <--------- here
ext2fs_bg_itable_unused(fs, i)) *
EXT2_INODE_SIZE(fs->super),
EXT2_BLOCK_SIZE(fs->super));
If we take the difference in inode count and multiply it by the constant inode size (256) we get (468844544-234422272)*256 = 60012101632 bytes ~55GiB of extra inode overhead.
Can anyone help me the math on how Inode count increased to 2x when
Inodes per groupincreased to 2x?Does
lazy_itable_inithave an impact at runtime that decides the value ofInodes per group, if so how can we understand what value will it set? (This flag was the only reference to s_inodes_per_group in the code)