My Ubuntu RAID configuration changes during a reboot and I don't understand why.
- Problem 1: /dev/md/host:name device is not created at boot. I don't understand how, when, or why the descriptive names are created and it would be better if it were more predictable.
This array always gets a device name:
jak # mdadm --detail /dev/md127 [...] Name : jak:neat (local to host jak) UUID : 593fc406:87eefd53:0a076a84:f1405112
This array almost never gets a device name:
jak # mdadm --detail /dev/md130 [...] Name : jak:sour (local to host jak) UUID : 809a185b:a2613844:3975b412:759ec297
My understanding is that the purpose of these human-readable names is for use in /etc/fstab:
jak # tail -2 /etc/fstab /dev/md130 /jak/data/sour ext4 defaults 0 0 /dev/md/jak:neat /jak/data/neat ext4 defaults 0 0
You can see that for jak:sour I am mounting /dev/md130 explicitly which is problematic since that device name changes occasionally. I also don't understand why the device number changes. What is the reliable /dev entry that I should use in fstab?
- Problem 2: My spare drives fall out of the arrays each reboot.
Both jak:neat and jak:sour have spare drives (mdadm output is below). After every reboot, the spare drives vanish from the arrays. The disk devices do appear in /dev/sd* and I can re-attach them to the arrays easily enough (mdadm --add /dev/md127 /dev/sdn1) but obviously I'd rather they not fall out of the array.
Groveling: Google and Stack Exchange searches for specific linux software raid array problems is kind of a miserable exercise in signal to noise. I hope these are good questions to ask.
Config details:
# uname -a Linux jak 4.15.0-65-generic #74-Ubuntu SMP Tue Sep 17 17:06:04 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
jak # mdadm --detail /dev/md127
/dev/md127:
Version : 1.2
Creation Time : Fri Jun 1 07:33:21 2018
Raid Level : raid10
Array Size : 23441682432 (22355.73 GiB 24004.28 GB)
Used Dev Size : 7813894144 (7451.91 GiB 8001.43 GB)
Raid Devices : 6
Total Devices : 7
Persistence : Superblock is persistent
Intent Bitmap : Internal
Update Time : Sun Oct 6 07:02:47 2019
State : clean, checking
Active Devices : 6
Working Devices : 7
Failed Devices : 0
Spare Devices : 1
Layout : near=2
Chunk Size : 512K
Consistency Policy : bitmap
Check Status : 18% complete
Name : jak:neat (local to host jak)
UUID : 593fc406:87eefd53:0a076a84:f1405112
Events : 201984
Number Major Minor RaidDevice State
0 8 17 0 active sync set-A /dev/sdb1
1 65 1 1 active sync set-B /dev/sdq1
2 8 81 2 active sync set-A /dev/sdf1
3 65 33 3 active sync set-B /dev/sds1
4 8 49 4 active sync set-A /dev/sdd1
6 65 17 5 active sync set-B /dev/sdr1
7 8 177 - spare /dev/sdl1
jak #
jak # mdadm --detail /dev/md130
/dev/md130:
Version : 1.2
Creation Time : Sat May 26 10:51:23 2018
Raid Level : raid6
Array Size : 39065217024 (37255.49 GiB 40002.78 GB)
Used Dev Size : 9766304256 (9313.87 GiB 10000.70 GB)
Raid Devices : 6
Total Devices : 7
Persistence : Superblock is persistent
Intent Bitmap : Internal
Update Time : Sun Oct 6 07:03:43 2019
State : clean, checking
Active Devices : 6
Working Devices : 7
Failed Devices : 0
Spare Devices : 1
Layout : left-symmetric
Chunk Size : 512K
Consistency Policy : bitmap
Check Status : 41% complete
Name : jak:sour (local to host jak)
UUID : 809a185b:a2613844:3975b412:759ec297
Events : 139649
Number Major Minor RaidDevice State
0 8 1 0 active sync /dev/sda1
1 8 33 1 active sync /dev/sdc1
2 8 129 2 active sync /dev/sdi1
3 8 145 3 active sync /dev/sdj1
4 8 161 4 active sync /dev/sdk1
6 8 241 5 active sync /dev/sdp1
7 8 209 - spare /dev/sdn1
jak #
I don't suspect /dev/sdn is suspicious but here are the details anyway.
jak # smartctl -a /dev/sdn
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.15.0-65-generic] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Device Model: WDC WD100EFAX-68LHPN0
Serial Number: JEKG3TLZ
LU WWN Device Id: 5 000cca 267f04f25
Firmware Version: 83.H0A83
User Capacity: 10,000,831,348,736 bytes [10.0 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: 5400 rpm
Form Factor: 3.5 inches
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: ACS-2, ATA8-ACS T13/1699-D revision 4
SATA Version is: SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Sun Oct 6 07:04:58 2019 PDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 93) seconds.
Offline data collection
capabilities: (0x5b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: (1116) minutes.
SCT capabilities: (0x003d) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000b 100 100 016 Pre-fail Always - 0
2 Throughput_Performance 0x0004 131 131 054 Old_age Offline - 104
3 Spin_Up_Time 0x0007 100 100 024 Pre-fail Always - 0
4 Start_Stop_Count 0x0012 100 100 000 Old_age Always - 3
5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail Always - 0
7 Seek_Error_Rate 0x000a 100 100 067 Old_age Always - 0
8 Seek_Time_Performance 0x0004 128 128 020 Old_age Offline - 18
9 Power_On_Hours 0x0012 100 100 000 Old_age Always - 1148
10 Spin_Retry_Count 0x0012 100 100 060 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 3
22 Unknown_Attribute 0x0023 100 100 025 Pre-fail Always - 100
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 51
193 Load_Cycle_Count 0x0012 100 100 000 Old_age Always - 51
194 Temperature_Celsius 0x0002 144 144 000 Old_age Always - 45 (Min/Max 20/54)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0
197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x000a 200 200 000 Old_age Always - 0
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
No self-tests have been logged. [To run self-tests, use: smartctl -t]
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
jak #
# mdadm.conf # # !NB! Run update-initramfs -u after updating this file. # !NB! This will ensure that initramfs has an uptodate copy. # # Please refer to mdadm.conf(5) for information about this file. # # by default (built-in), scan all partitions (/proc/partitions) and all # containers for MD superblocks. alternatively, specify devices to scan, using # wildcards if desired. #DEVICE partitions containers # automatically tag new arrays as belonging to the local system HOMEHOST # instruct the monitoring daemon where to send mail alerts MAILADDR root # definitions of existing MD arrays ARRAY /dev/md/0 metadata=1.2 UUID=7636d5d0:8c9d0823:7252c563:30724789 name=jak:0 ARRAY /dev/md130 metadata=1.2 UUID=809a185b:a2613844:3975b412:759ec297 name=jak:sour # This configuration was auto-generated on Sat, 28 Apr 2018 16:34:38 -0700 by mkconf
Remaining question now
/etc/md/jak:neatand/etc/md/jak:sourdo not appear. What process creates those?running
blkid /dev/md127returns no output. It does return expected output against/dev/md0. Therefore, myfstabnow lists the file system devices as/dev/md127and/dev/md130which feels temporary.
Comments appreciated in advance!