SMART mystery

This morning I get the bad bad bad email telling me my 'scratch' disk is dying.

I login and see:
frost:~# smartctl --all /dev/sdb ... === START OF INFORMATION SECTION === Model Family: Western Digital Caviar Green family Device Model: WDC WD15EADS-00S2B0 Serial Number: WD-WCAVY... Firmware Version: 01.00A01 User Capacity: 1,500,301,910,016 bytes ... === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: FAILED! Drive failure expected in less than 24 hours. SAVE ALL DATA. No failed Attributes found. ... SMART Error Log Version: 1 No Errors Logged

So I ~~find~~make some place to copy the movies we recorded during the holidays on the RAID-1 disks and I start running the self tests.

The captive tests fail and the kernel protests:

Jan 7 13:52:24 frost kernel: [9419076.780021] ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen Jan 7 13:52:24 frost kernel: [9419076.780878] ata4.00: failed command: SMART Jan 7 13:52:24 frost kernel: [9419076.781741] ata4.00: cmd b0/d4:00:83:4f:c2/00:00:00:00:00/00 tag 0 Jan 7 13:52:24 frost kernel: [9419076.781742] res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout) Jan 7 13:52:24 frost kernel: [9419076.783552] ata4.00: status: { DRDY } Jan 7 13:52:24 frost kernel: [9419076.784519] ata4: hard resetting link Jan 7 13:52:25 frost kernel: [9419077.272012] ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 300) Jan 7 13:52:25 frost kernel: [9419077.288319] ata4.00: configured for UDMA/133 Jan 7 13:52:25 frost kernel: [9419077.288332] ata4: EH complete

So I decide to run the tests in non-captive mode. They run to completion and worse, smartctl then gives as output:

=== START OF INFORMATION SECTION === Model Family: Western Digital Caviar Green family Device Model: WDC WD15EADS-00S2B0 ... === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED ... SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0027 253 147 021 Pre-fail Always - 758 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 19 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0 9 Power_On_Hours 0x0032 090 090 000 Old_age Always - 7614 10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 18 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 10 193 Load_Cycle_Count 0x0032 159 159 000 Old_age Always - 125099 194 Temperature_Celsius 0x0022 103 092 000 Old_age Always - 49 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 1 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed without error 00% 7611 - # 2 Short offline Completed without error 00% 7606 - # 3 Conveyance offline Completed without error 00% 7606 - # 4 Conveyance captive Interrupted (host reset) 90% 7605 - # 5 Extended captive Interrupted (host reset) 90% 7603 -
So first the disk is dying, and now it is fine again???

I'm seriously thinking to order a pair of Samsung HD204UI 2TB disks to have a 3x 1.5T RAID5 with a 500M raid 1 array, as I don't trust this disk anymore and we will need the extra space soon.

frost:~# for i in /dev/sd? ; do echo -- $i --- ; smartctl -a $i | grep emperature ; done -- /dev/sda --- 190 Airflow_Temperature_Cel 0x0022 068 052 000 Old_age Always - 32 194 Temperature_Celsius 0x0022 142 094 000 Old_age Always - 32 -- /dev/sdb --- 194 Temperature_Celsius 0x0022 104 092 000 Old_age Always - 48 -- /dev/sdc --- 190 Airflow_Temperature_Cel 0x0022 067 053 000 Old_age Always - 33 194 Temperature_Celsius 0x0022 139 097 000 Old_age Always - 33

S	M	T	W	T	F	S
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31

Most Popular Tags

buying - 1 use
car - 1 use
cisco - 1 use
common-lisp - 2 uses
cooking - 1 use
crypto - 1 use
debian - 17 uses
dreambox - 1 use
dreams - 1 use
dyndns - 1 use
elections - 1 use
fixing stuff - 1 use
fosdem - 2 uses
freebsd - 4 uses
funny - 1 use
git - 1 use
gnuplot - 1 use
google-me - 2 uses
hacking - 1 use
hardware - 1 use
holiday - 1 use
house - 2 uses
hugo - 3 uses
india - 1 use
ipod - 2 uses
ipv6 - 3 uses
jails - 1 use
joke - 1 use
life - 54 uses
life hardware - 1 use
lille rijsel trip - 1 use
lisp - 6 uses
networking - 1 use
opensource - 6 uses
reading - 2 uses
repairs - 1 use
sailing - 1 use
sci-fi - 2 uses
sci-fi. sf - 1 use
security - 2 uses
sewage house - 1 use
sf - 1 use
star wars - 1 use
storage - 1 use
tricks - 1 use
upgrading - 1 use
work - 2 uses
work cisco - 1 use
x11 - 1 use
zfs - 5 uses

Flat | Top-Level Comments Only

From:

sweh

The SMART test is being aborted (#4 and #5 "host reset") so at some point the disk stops responding to the kernel and the kernel resets the bus and this aborts the SMART test.

Disk sounds bad!

pvaneynd

This far I understood, however this doesn't seem abnormal, or in $WORK terms "this is expected behaviour".

So if this is so, why does smartctl have a captive option?

rbarclay

49 degC? Really?

yep.

frost:~# for i in /dev/sd? ; do echo -- $i --- ; smartctl -a $i | grep emperature ; done -- /dev/sda --- 190 Airflow_Temperature_Cel 0x0022 068 052 000 Old_age Always - 32 194 Temperature_Celsius 0x0022 142 094 000 Old_age Always - 32 -- /dev/sdb --- 194 Temperature_Celsius 0x0022 104 092 000 Old_age Always - 48 -- /dev/sdc --- 190 Airflow_Temperature_Cel 0x0022 067 053 000 Old_age Always - 33 194 Temperature_Celsius 0x0022 139 097 000 Old_age Always - 33

Seem that this disk runs a little hotter then the SpinPoint T166 disks in the RAID array.

Planning to get HD204UI 2TB disks to replace the 2 500G RAID disks...

What does the finger test say? Really in the "ouch-ouch-ouch"-region of hot? Because if it is, small wonder it's starting to fail (and some extra ventilation, if feasable, might help it along until the replacements arrive).

heliumbreath

There was a thread in l'autre place awhile ago about Caviar Greens; summary was that their SMART data is serious fiction and they're way too eager to spin down to save power. They appear better suited to periodic backup use than to serious server use. OTOH, I've got one ZFS RAID'ed with a Seagate, and so far so good.

pvaneynd

SMART mystery

SMART mystery

no subject

no subject

no subject

no subject

no subject

no subject

Profile

October 2025

Most Popular Tags

Page Summary

Style Credit

Expand Cut Tags