pvaneynd: (Default)
[personal profile] pvaneynd
This morning I get the bad bad bad email telling me my 'scratch' disk is dying.

I login and see:
frost:~# smartctl --all /dev/sdb
...
=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Caviar Green family
Device Model:     WDC WD15EADS-00S2B0
Serial Number:    WD-WCAVY...
Firmware Version: 01.00A01
User Capacity:    1,500,301,910,016 bytes
...
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
Drive failure expected in less than 24 hours. SAVE ALL DATA.
No failed Attributes found.
...
SMART Error Log Version: 1
No Errors Logged


So I findmake some place to copy the movies we recorded during the holidays on the RAID-1 disks and I start running the self tests.

The captive tests fail and the kernel protests:

Jan 7 13:52:24 frost kernel: [9419076.780021] ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Jan 7 13:52:24 frost kernel: [9419076.780878] ata4.00: failed command: SMART
Jan 7 13:52:24 frost kernel: [9419076.781741] ata4.00: cmd b0/d4:00:83:4f:c2/00:00:00:00:00/00 tag 0
Jan 7 13:52:24 frost kernel: [9419076.781742] res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
Jan 7 13:52:24 frost kernel: [9419076.783552] ata4.00: status: { DRDY }
Jan 7 13:52:24 frost kernel: [9419076.784519] ata4: hard resetting link
Jan 7 13:52:25 frost kernel: [9419077.272012] ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Jan 7 13:52:25 frost kernel: [9419077.288319] ata4.00: configured for UDMA/133
Jan 7 13:52:25 frost kernel: [9419077.288332] ata4: EH complete


So I decide to run the tests in non-captive mode. They run to completion and worse, smartctl then gives as output:

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Caviar Green family
Device Model:     WDC WD15EADS-00S2B0
...
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
...
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   253   147   021    Pre-fail  Always       -       758
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       19
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   090   090   000    Old_age   Always       -       7614
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       18
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       10
193 Load_Cycle_Count        0x0032   159   159   000    Old_age   Always       -       125099
194 Temperature_Celsius     0x0022   103   092   000    Old_age   Always       -       49
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       1

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%      7611         -
# 2  Short offline       Completed without error       00%      7606         -
# 3  Conveyance offline  Completed without error       00%      7606         -
# 4  Conveyance captive  Interrupted (host reset)      90%      7605         -
# 5  Extended captive    Interrupted (host reset)      90%      7603         -

So first the disk is dying, and now it is fine again???

I'm seriously thinking to order a pair of Samsung HD204UI 2TB disks to have a 3x 1.5T RAID5 with a 500M raid 1 array, as I don't trust this disk anymore and we will need the extra space soon.

Date: 2011-01-08 12:07 am (UTC)
sweh: (Default)
From: [personal profile] sweh
The SMART test is being aborted (#4 and #5 "host reset") so at some point the disk stops responding to the kernel and the kernel resets the bus and this aborts the SMART test.

Disk sounds bad!

Profile

pvaneynd: (Default)
pvaneynd

September 2023

S M T W T F S
     12
3456789
10111213141516
171819 20212223
24252627282930

Most Popular Tags

Page Summary

Style Credit

Expand Cut Tags

No cut tags
Page generated Jul. 3rd, 2025 07:04 pm
Powered by Dreamwidth Studios