0

just noticed I was using SDD for SSD. Corrected

I need help interpreting this situation. /dev/sda is a data disk backed up and with reproducible data so this is not system critical but I'd like to avoid the effort of restoring/reconstructing the data some of which will be quite time consuming

Is recovery / repair possible?

If so how? If I wipe the disk for re-use what is its reliability?

Summary (detailed reports below):

  • will not mount: bad superblock
  • badblocks finds no bad blocks
  • smartctl reports no errors
  • fsck cannot set superblock flags
  • fdisk shows clean partition
  • dmesg shows write errors
  • parted shows 792 GB free of 1 TB drive

Mount ssd fails as so:

 [stephen@meer ~]$ sudo mount /dev/sda1 /mnt/sda
 mount: /mnt/sda: can't read superblock on /dev/sda1.
        dmesg(1) may have more information after failed mount system call.
 [stephen@meer ~]$ 
 

but badblocks finds no bad blocks

 [root@meer stephen]# badblocks -v /dev/sda1              
 Checking blocks 0 to 976760831
 Checking for bad blocks (read-only test): done                                                 
 Pass completed, 0 bad blocks found. (0/0/0 errors)

But smartctl finds no errors

 [root@meer stephen]# smartctl -a /dev/sda 
 smartctl 7.3 2022-02-28 r5338 [x86_64-linux-5.17.9-arch1-1] (local build)
 Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org
 
 === START OF INFORMATION SECTION ===
 Model Family:     WD Blue / Red / Green SSDs
 Device Model:     WDC  WDS100T2B0A-00SM50
 Serial Number:    213159800516
 LU WWN Device Id: 5 001b44 8bc4fdc6e
 Firmware Version: 415020WD
 User Capacity:    1,000,204,886,016 bytes [1.00 TB]
 Sector Size:      512 bytes logical/physical
 Rotation Rate:    Solid State Device
 Form Factor:      2.5 inches
 TRIM Command:     Available, deterministic, zeroed
 Device is:        In smartctl database 7.3/5319
 ATA Version is:   ACS-4 T13/BSR INCITS 529 revision 5
 SATA Version is:  SATA 3.3, 6.0 Gb/s (current: 1.5 Gb/s)
 Local Time is:    Tue May 24 16:06:23 2022 PDT
 SMART support is: Available - device has SMART capability.
 SMART support is: Enabled
 
 === START OF READ SMART DATA SECTION ===
 SMART overall-health self-assessment test result: PASSED
 
 General SMART Values:
 Offline data collection status:  (0x00)    Offline data collection activity
                    was never started.
                    Auto Offline Data Collection: Disabled.
 Self-test execution status:      (   0)    The previous self-test routine completed
                    without error or no self-test has ever 
                    been run.
 Total time to complete Offline 
 data collection:       (    0) seconds.
 Offline data collection
 capabilities:           (0x11) SMART execute Offline immediate.
                    No Auto Offline data collection support.
                    Suspend Offline collection upon new
                    command.
                    No Offline surface scan supported.
                    Self-test supported.
                    No Conveyance Self-test supported.
                    No Selective Self-test supported.
 SMART capabilities:            (0x0003)    Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
 Error logging capability:        (0x01)    Error logging supported.
                    General Purpose Logging supported.
 Short self-test routine 
 recommended polling time:   (   2) minutes.
 Extended self-test routine
 recommended polling time:   (  10) minutes.
 
 SMART Attributes Data Structure revision number: 4
 Vendor Specific SMART Attributes with Thresholds:
 ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
   5 Reallocated_Sector_Ct   0x0032   100   100   ---    Old_age   Always       -       124
   9 Power_On_Hours          0x0032   100   100   ---    Old_age   Always       -       1470
  12 Power_Cycle_Count       0x0032   100   100   ---    Old_age   Always       -       134
 165 Block_Erase_Count       0x0032   100   100   ---    Old_age   Always       -       4312400063
 166 Minimum_PE_Cycles_TLC   0x0032   100   100   ---    Old_age   Always       -       1
 167 Max_Bad_Blocks_per_Die  0x0032   100   100   ---    Old_age   Always       -       65
 168 Maximum_PE_Cycles_TLC   0x0032   100   100   ---    Old_age   Always       -       14
 169 Total_Bad_Blocks        0x0032   100   100   ---    Old_age   Always       -       630
 170 Grown_Bad_Blocks        0x0032   100   100   ---    Old_age   Always       -       124
 171 Program_Fail_Count      0x0032   100   100   ---    Old_age   Always       -       128
 172 Erase_Fail_Count        0x0032   100   100   ---    Old_age   Always       -       0
 173 Average_PE_Cycles_TLC   0x0032   100   100   ---    Old_age   Always       -       2
 174 Unexpected_Power_Loss   0x0032   100   100   ---    Old_age   Always       -       90
 184 End-to-End_Error        0x0032   100   100   ---    Old_age   Always       -       0
 187 Reported_Uncorrect      0x0032   100   100   ---    Old_age   Always       -       0
 188 Command_Timeout         0x0032   100   100   ---    Old_age   Always       -       64
 194 Temperature_Celsius     0x0022   070   053   ---    Old_age   Always       -       30 (Min/Max 18/53)
 199 UDMA_CRC_Error_Count    0x0032   100   100   ---    Old_age   Always       -       0
 230 Media_Wearout_Indicator 0x0032   001   001   ---    Old_age   Always       -       0x002600140026
 232 Available_Reservd_Space 0x0033   097   097   004    Pre-fail  Always       -       97
 233 NAND_GB_Written_TLC     0x0032   100   100   ---    Old_age   Always       -       2703
 234 NAND_GB_Written_SLC     0x0032   100   100   ---    Old_age   Always       -       2842
 241 Host_Writes_GiB         0x0030   253   253   ---    Old_age   Offline      -       466
 242 Host_Reads_GiB          0x0030   253   253   ---    Old_age   Offline      -       622
 244 Temp_Throttle_Status    0x0032   000   100   ---    Old_age   Always       -       0
 
 SMART Error Log Version: 1
 No Errors Logged
 
 SMART Self-test log structure revision number 1
 Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
 # 1  Extended offline    Completed without error       00%      1470         -
 
 Selective Self-tests/Logging not supported
 
 

and fsck fails as so:

 [root@meer ~]# e2fsck -cfpv /dev/sda1
 /dev/sda1: recovering journal
 e2fsck: Input/output error while recovering journal of /dev/sda1
 e2fsck: unable to set superblock flags on /dev/sda1
 
 
 /dev/sda1: ********** WARNING: Filesystem still has errors **********
 
 
 
 
 
 May 24 15:38:29 meer kernel: I/O error, dev sda, sector 121899008 op 0x1:(WRITE) flags 0x800 phys_seg 1 prio class 0
 May 24 15:38:29 meer kernel: sd 2:0:0:0: [sda] tag#31 CDB: Write(10) 2a 00 07 44 08 00 00 00 08 00
 May 24 15:38:29 meer kernel: sd 2:0:0:0: [sda] tag#31 Add. Sense: Unaligned write command
 May 24 15:38:29 meer kernel: sd 2:0:0:0: [sda] tag#31 Sense Key : Illegal Request [current] 
 May 24 15:38:29 meer kernel: sd 2:0:0:0: [sda] tag#31 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s
 May 24 15:38:29 meer kernel: ata3.00: configured for UDMA/33
 May 24 15:38:29 meer kernel: ata3.00: error: { ABRT }
 May 24 15:38:29 meer kernel: ata3.00: status: { DRDY ERR }
 May 24 15:38:29 meer kernel: ata3.00: cmd ca/00:08:00:08:44/00:00:00:00:00/e7 tag 31 dma 4096 out
                                       res 51/04:08:00:08:44/00:00:07:00:00/e7 Emask 0x1 (device error)
 May 24 15:38:29 meer kernel: ata3.00: failed command: WRITE DMA
 May 24 15:38:29 meer kernel: ata3.00: irq_stat 0x40000001
 May 24 15:38:29 meer kernel: ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
 May 24 15:38:29 meer kernel: ata3: EH complete
 May 24 15:38:29 meer kernel: ata3.00: configured for UDMA/33
 May 24 15:38:29 meer kernel: ata3.00: error: { ABRT }
 May 24 15:38:29 meer kernel: ata3.00: status: { DRDY ERR }
 May 24 15:38:29 meer kernel: ata3.00: cmd ca/00:08:00:08:44/00:00:00:00:00/e7 tag 6 dma 4096 out
                                       res 51/04:08:00:08:44/00:00:07:00:00/e7 Emask 0x1 (device error)
 May 24 15:38:29 meer kernel: ata3.00: failed command: WRITE DMA
 May 24 15:38:29 meer kernel: ata3.00: irq_stat 0x40000001
 May 24 15:38:29 meer kernel: ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
 

Partitioning as seen by fdisk.

 Disk /dev/sda: 931.51 GiB, 1000204886016 bytes, 1953525168 sectors
 Disk model: WDC  WDS100T2B0A
 Units: sectors of 1 * 512 = 512 bytes
 Sector size (logical/physical): 512 bytes / 512 bytes
 I/O size (minimum/optimal): 512 bytes / 512 bytes
 Disklabel type: gpt
 Disk identifier: 3F701164-2CF8-6D48-A94E-478634C140BE
 
 Device     Start        End    Sectors   Size Type
 /dev/sda1   2048 1953523711 1953521664 931.5G Linux filesystem

From dmesg

 [ 5292.895300] ata3.00: configured for UDMA/33
 [ 5292.895315] ata3: EH complete
 [ 5293.021851] ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
 [ 5293.021859] ata3.00: irq_stat 0x40000001
 [ 5293.021864] ata3.00: failed command: WRITE DMA
 [ 5293.021866] ata3.00: cmd ca/00:08:00:08:44/00:00:00:00:00/e7 tag 18 dma 4096 out
                         res 51/04:08:00:08:44/00:00:07:00:00/e7 Emask 0x1 (device error)
 [ 5293.021874] ata3.00: status: { DRDY ERR }
 [ 5293.021877] ata3.00: error: { ABRT }

parted :

 root@meer stephen]# parted /dev/sda
 GNU Parted 3.5
 Using /dev/sda
 Welcome to GNU Parted! Type 'help' to view a list of commands.
 (parted) print free                                                       
 Model: ATA WDC WDS100T2B0A (scsi)
 Disk /dev/sda: 1000GB
 Sector size (logical/physical): 512B/512B
 Partition Table: gpt
 Disk Flags: 
 
 Number  Start   End     Size    File system  Name  Flags
         17.4kB  1049kB  1031kB  Free Space
  1      1049kB  1000GB  1000GB  ext4
         1000GB  1000GB  729kB   Free Space
 
5
  • 1
    @roaima Replaced smartctl section with full output. smartctl -a edit Commented May 24, 2022 at 23:05
  • What does the partition table look like, anything in dmesg? Commented May 25, 2022 at 1:08
  • @rfmodulator added Commented May 25, 2022 at 1:28
  • ata3.00: configured for UDMA/33 1995 has called, they want their IDE hard drives back! This has nothing to do with your drive failure (probably), but you should definitely configure your UEFI to not emulate an IDE interface for SATA or NVMe SSDs. Commented May 28, 2022 at 20:06
  • @MarcusMüller I had no idea I'd done that. Where did I do it? How? Commented May 29, 2022 at 2:33

1 Answer 1

3

I don't know what you've been doing with this disk, but that's crazy numbers! Looking at that output that SSD has been on:

  • 1470 hours (61 days)
  • performed 4312400063 (2.0GiB) block erases
  • 163210068006 (76TiB) media writes.

That's a constant 16MiB a second of writes over 61 days.

I imagine you've got internal NAND failure. You might not be able to get your data back.

I suggest your best solution here going forwards is to use a raid mirror of some form to buffer the errors between multiple disks.

Ideally, it would be two disks of different ages and/or different production batches to attempt to spread out the distribution of errors and failures between multiple disks.

Just to clarify, I consider that an abnormally high amount of writes over a very short period. You're going to need to factor that in to the storage setup you go with.

5
  • It's a data drive for an always-on computer. It has one running VirtualBox VM and sometimes two. The VM is a torrent host that is nearly always receiving, always seeding. I keep my Dropbox on there to which I mirror my git. My Dropbox sits on three LAN hosts which contribute a small selection of logging... I suppose that would make it an active drive for a personal computer? Would that activity explain the numbers or is it likely I have some wacky process running away? Is the drive kaput? I understand SSDs have limited writes so... but smartctl seems okay. Commented May 25, 2022 at 11:35
  • Could this be the result of a general power failure? Commented May 25, 2022 at 12:04
  • It really seems as if the drive is worn out. Try to rescue the data with testdiskand then retire the drive Commented May 25, 2022 at 12:28
  • @gerhardd. Why does smartctl not detect that. Given the activity would a HDD be a better choice? Commented May 25, 2022 at 13:06
  • 1
    Well i guess it detected it. Have a look at the meaning of all the counter values. For constant high volume small-data writes, a HDD would obviously be the better solution. Commented May 25, 2022 at 15:30

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.