Re: ad0 READ_DMA TIMEOUT errors on install of 7.0-RELEASE



Jeremy Chadwick wrote:
And after the reboot, the READ_DMA timeouts were back.

You're not the only one seeing this behaviour. There are too many posts
in the past reporting similar. Here's the breakdown:

* Some have switched to alternate operating systems (usually Linux) for
a short while and seen no sign of DMA timeouts.

Booting the 6.3-RELEASE CD seems to make the problem go away... possibly 7.0 stresses the HD more?

However: in your case, your disk does look to have problems based on the
SMART output you provided. It does not matter how new/old the disk is,
by the way. I'll point out the problematic stats. You need to replace
the disk ASAP.

Yeah, that's pretty much what I figured, the timing (ie: the moment I boot 7.0-RELEASE) is the only bit that seems fishy. This HD has been powered on pretty much continuously for around three years. Given that it's a Maxtor, I'm honestly a bit surprised that it's lasted as well as it has.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
5 Reallocated_Sector_Ct 0x0033 253 253 063 Pre-fail Always - 4

This shows you've had 4 reallocated sectors, meaning your disk does in
fact have bad blocks. In 90% of the cases out there, bad blocks
continue to "grow" over time, due to whatever reason (I remember reading
an article explaining it, but I can't for the life of me find the URL).

This is unusual now? I've always "known" that a small number of bad blocks is normal. Time to readjust my knowledge again?

194 Temperature_Celsius 0x0032 253 253 000 Old_age Always - 48

This is excessive, and may be attributing to problems. A hard disk
running at 48C is not a good sign. This should really be somewhere
between high 20s and mid 30s.

Yeah, this is a known problem with this drive... it's been running hot for years. I always figured it was due to the rotational speed increase in commodity drives.

Error 2 occurred at disk power-on lifetime: 5171 hours (215 days + 11 hours)
When the command that caused the error occurred, the device was in an unknown state.
Error 1 occurred at disk power-on lifetime: 5171 hours (215 days + 11 hours)
When the command that caused the error occurred, the device was in an unknown state.

These are automated SMART log entries confirming the DMA failures. The
fact that SMART saw them means that the disk is also aware of said
issues. These may have been caused by the reallocated sectors. It's
also interesting that the LBAs are different than the ones FreeBSD
reported issues with.

If that power on lifetime is accurate, that was at least a year ago... but I can't find any documentation as to when the power-on lifetime wraps or what it actually indicates. I'm assuming that it is total power on time since the drive was manufactured. If it's total hours as a 16-bit integer, it shouldn't wrap. Is there a way of getting the "current" power-on lifetime value that you're aware of? That power on minutes is interesting, but its current value is lower than the value at the error (but higher than the power uptime of the system):
9 Power_On_Minutes 0x0032 219 219 000 Old_age Always - 1061h+40m

Also interesting is that after getting more errors from FreeBSD, I did not get more errors in smartctl.

My advice to you is: replace the disk ASAP. This problem will only get
worse. Try another hard disk brand too (I don't have anything "against"
Maxtor, but usually its recommended to avoid a brand you have problems
with until the next time you have issues, then switch brands, etc.
etc...). I'm very fond of Western Digital's SE16, RE, and RE2 series
currently. But avoid Fujitsu and Samsung (both have a long track record
of having buggy drive firmwares, forcing vendors to make custom
workarounds for issues); stick with Seagate, Western Digital, or Maxtor.

Yeah, that's my plan... but I wanted to stake out some whining rights in advance so I can do the "But you said it was a bad HD or cable! Now I'm out $x00 and my system still doesn't work! Help me or I switch to DragonFly BSD/Desktop BSD/Linux which is perfect and has no problems!" thing. Then go on Slashdot and post long rambling messages about how FreeBSD is dead and it doesn't matter than the manpages on any given Linux box are useless.

_______________________________________________
freebsd-stable@xxxxxxxxxxx mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscribe@xxxxxxxxxxx"