SMART mystery
Jan. 7th, 2011 10:17 pm![[personal profile]](https://www.dreamwidth.org/img/silk/identity/user.png)
This morning I get the bad bad bad email telling me my 'scratch' disk is dying.
I login and see:
So Ifindmake some place to copy the movies we recorded during the holidays on the RAID-1 disks and I start running the self tests.
The captive tests fail and the kernel protests:
Jan 7 13:52:24 frost kernel: [9419076.780021] ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Jan 7 13:52:24 frost kernel: [9419076.780878] ata4.00: failed command: SMART
Jan 7 13:52:24 frost kernel: [9419076.781741] ata4.00: cmd b0/d4:00:83:4f:c2/00:00:00:00:00/00 tag 0
Jan 7 13:52:24 frost kernel: [9419076.781742] res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
Jan 7 13:52:24 frost kernel: [9419076.783552] ata4.00: status: { DRDY }
Jan 7 13:52:24 frost kernel: [9419076.784519] ata4: hard resetting link
Jan 7 13:52:25 frost kernel: [9419077.272012] ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Jan 7 13:52:25 frost kernel: [9419077.288319] ata4.00: configured for UDMA/133
Jan 7 13:52:25 frost kernel: [9419077.288332] ata4: EH complete
So I decide to run the tests in non-captive mode. They run to completion and worse, smartctl then gives as output:
So first the disk is dying, and now it is fine again???
I'm seriously thinking to order a pair of Samsung HD204UI 2TB disks to have a 3x 1.5T RAID5 with a 500M raid 1 array, as I don't trust this disk anymore and we will need the extra space soon.
I login and see:
frost:~# smartctl --all /dev/sdb ... === START OF INFORMATION SECTION === Model Family: Western Digital Caviar Green family Device Model: WDC WD15EADS-00S2B0 Serial Number: WD-WCAVY... Firmware Version: 01.00A01 User Capacity: 1,500,301,910,016 bytes ... === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: FAILED! Drive failure expected in less than 24 hours. SAVE ALL DATA. No failed Attributes found. ... SMART Error Log Version: 1 No Errors Logged
So I
The captive tests fail and the kernel protests:
Jan 7 13:52:24 frost kernel: [9419076.780021] ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Jan 7 13:52:24 frost kernel: [9419076.780878] ata4.00: failed command: SMART
Jan 7 13:52:24 frost kernel: [9419076.781741] ata4.00: cmd b0/d4:00:83:4f:c2/00:00:00:00:00/00 tag 0
Jan 7 13:52:24 frost kernel: [9419076.781742] res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
Jan 7 13:52:24 frost kernel: [9419076.783552] ata4.00: status: { DRDY }
Jan 7 13:52:24 frost kernel: [9419076.784519] ata4: hard resetting link
Jan 7 13:52:25 frost kernel: [9419077.272012] ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Jan 7 13:52:25 frost kernel: [9419077.288319] ata4.00: configured for UDMA/133
Jan 7 13:52:25 frost kernel: [9419077.288332] ata4: EH complete
So I decide to run the tests in non-captive mode. They run to completion and worse, smartctl then gives as output:
=== START OF INFORMATION SECTION === Model Family: Western Digital Caviar Green family Device Model: WDC WD15EADS-00S2B0 ... === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED ... SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0027 253 147 021 Pre-fail Always - 758 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 19 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0 9 Power_On_Hours 0x0032 090 090 000 Old_age Always - 7614 10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 18 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 10 193 Load_Cycle_Count 0x0032 159 159 000 Old_age Always - 125099 194 Temperature_Celsius 0x0022 103 092 000 Old_age Always - 49 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 1 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed without error 00% 7611 - # 2 Short offline Completed without error 00% 7606 - # 3 Conveyance offline Completed without error 00% 7606 - # 4 Conveyance captive Interrupted (host reset) 90% 7605 - # 5 Extended captive Interrupted (host reset) 90% 7603 -
So first the disk is dying, and now it is fine again???
I'm seriously thinking to order a pair of Samsung HD204UI 2TB disks to have a 3x 1.5T RAID5 with a 500M raid 1 array, as I don't trust this disk anymore and we will need the extra space soon.
no subject
Date: 2011-01-08 12:07 am (UTC)Disk sounds bad!
no subject
Date: 2011-01-08 06:03 pm (UTC)So if this is so, why does smartctl have a captive option?
no subject
Date: 2011-01-08 12:20 pm (UTC)no subject
Date: 2011-01-08 06:06 pm (UTC)Seem that this disk runs a little hotter then the SpinPoint T166 disks in the RAID array.
Planning to get HD204UI 2TB disks to replace the 2 500G RAID disks...
no subject
Date: 2011-01-08 06:40 pm (UTC)no subject
Date: 2011-01-08 10:32 pm (UTC)