pvaneynd: (Default)
[personal profile] pvaneynd
This morning I get the bad bad bad email telling me my 'scratch' disk is dying.

I login and see:
frost:~# smartctl --all /dev/sdb
...
=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Caviar Green family
Device Model:     WDC WD15EADS-00S2B0
Serial Number:    WD-WCAVY...
Firmware Version: 01.00A01
User Capacity:    1,500,301,910,016 bytes
...
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
Drive failure expected in less than 24 hours. SAVE ALL DATA.
No failed Attributes found.
...
SMART Error Log Version: 1
No Errors Logged


So I findmake some place to copy the movies we recorded during the holidays on the RAID-1 disks and I start running the self tests.

The captive tests fail and the kernel protests:

Jan 7 13:52:24 frost kernel: [9419076.780021] ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Jan 7 13:52:24 frost kernel: [9419076.780878] ata4.00: failed command: SMART
Jan 7 13:52:24 frost kernel: [9419076.781741] ata4.00: cmd b0/d4:00:83:4f:c2/00:00:00:00:00/00 tag 0
Jan 7 13:52:24 frost kernel: [9419076.781742] res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
Jan 7 13:52:24 frost kernel: [9419076.783552] ata4.00: status: { DRDY }
Jan 7 13:52:24 frost kernel: [9419076.784519] ata4: hard resetting link
Jan 7 13:52:25 frost kernel: [9419077.272012] ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Jan 7 13:52:25 frost kernel: [9419077.288319] ata4.00: configured for UDMA/133
Jan 7 13:52:25 frost kernel: [9419077.288332] ata4: EH complete


So I decide to run the tests in non-captive mode. They run to completion and worse, smartctl then gives as output:

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Caviar Green family
Device Model:     WDC WD15EADS-00S2B0
...
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
...
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   253   147   021    Pre-fail  Always       -       758
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       19
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   090   090   000    Old_age   Always       -       7614
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       18
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       10
193 Load_Cycle_Count        0x0032   159   159   000    Old_age   Always       -       125099
194 Temperature_Celsius     0x0022   103   092   000    Old_age   Always       -       49
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       1

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%      7611         -
# 2  Short offline       Completed without error       00%      7606         -
# 3  Conveyance offline  Completed without error       00%      7606         -
# 4  Conveyance captive  Interrupted (host reset)      90%      7605         -
# 5  Extended captive    Interrupted (host reset)      90%      7603         -

So first the disk is dying, and now it is fine again???

I'm seriously thinking to order a pair of Samsung HD204UI 2TB disks to have a 3x 1.5T RAID5 with a 500M raid 1 array, as I don't trust this disk anymore and we will need the extra space soon.

Date: 2011-01-08 12:07 am (UTC)
sweh: (Default)
From: [personal profile] sweh
The SMART test is being aborted (#4 and #5 "host reset") so at some point the disk stops responding to the kernel and the kernel resets the bus and this aborts the SMART test.

Disk sounds bad!

Date: 2011-01-08 12:20 pm (UTC)
rbarclay: (adminspotting)
From: [personal profile] rbarclay
49 degC? Really?

Date: 2011-01-08 06:40 pm (UTC)
rbarclay: (Default)
From: [personal profile] rbarclay
What does the finger test say? Really in the "ouch-ouch-ouch"-region of hot? Because if it is, small wonder it's starting to fail (and some extra ventilation, if feasable, might help it along until the replacements arrive).

Date: 2011-01-08 10:32 pm (UTC)
heliumbreath: (Default)
From: [personal profile] heliumbreath
There was a thread in l'autre place awhile ago about Caviar Greens; summary was that their SMART data is serious fiction and they're way too eager to spin down to save power. They appear better suited to periodic backup use than to serious server use. OTOH, I've got one ZFS RAID'ed with a Seagate, and so far so good.

Profile

pvaneynd: (Default)
pvaneynd

September 2023

S M T W T F S
     12
3456789
10111213141516
171819 20212223
24252627282930

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags
Page generated Jul. 1st, 2025 02:43 pm
Powered by Dreamwidth Studios