Re: ad0 READ_DMA TIMEOUT errors on install of 7.0-RELEASE



Stephen Hurd wrote:

This shows you've had 4 reallocated sectors, meaning your disk does in
fact have bad blocks. In 90% of the cases out there, bad blocks
continue to "grow" over time, due to whatever reason (I remember reading
an article explaining it, but I can't for the life of me find the URL).

This is unusual now? I've always "known" that a small number of bad blocks is normal. Time to readjust my knowledge again?

Modern drives hide bad sectors by keeping a pool of spare tracks and
automatically remapping bad sectors to that pool. The problem lies in
when the drive has aged enough that it's run out of spares.


194 Temperature_Celsius 0x0032 253 253 000 Old_age Always - 48

This is excessive, and may be attributing to problems. A hard disk
running at 48C is not a good sign. This should really be somewhere
between high 20s and mid 30s.

Yeah, this is a known problem with this drive... it's been running hot for years. I always figured it was due to the rotational speed increase in commodity drives.

48C is high, but I wouldn't consider it excessive. Drives that start generating "excessive" heat tend to fail shortly thereafter. I do agree that the heat is probably shortening the lifespan on the drive.


Error 2 occurred at disk power-on lifetime: 5171 hours (215 days + 11 hours)
When the command that caused the error occurred, the device was in an unknown state.
Error 1 occurred at disk power-on lifetime: 5171 hours (215 days + 11 hours)
When the command that caused the error occurred, the device was in an unknown state.

These are automated SMART log entries confirming the DMA failures. The
fact that SMART saw them means that the disk is also aware of said
issues. These may have been caused by the reallocated sectors. It's
also interesting that the LBAs are different than the ones FreeBSD
reported issues with.

If that power on lifetime is accurate, that was at least a year ago... but I can't find any documentation as to when the power-on lifetime wraps or what it actually indicates. I'm assuming that it is total power on time since the drive was manufactured. If it's total hours as a 16-bit integer, it shouldn't wrap. Is there a way of getting the "current" power-on lifetime value that you're aware of? That power on minutes is interesting, but its current value is lower than the value at the error (but higher than the power uptime of the system):
9 Power_On_Minutes 0x0032 219 219 000 Old_age Always - 1061h+40m

Also interesting is that after getting more errors from FreeBSD, I did not get more errors in smartctl.


The errors you're getting from FreeBSD have nothing to do directly with
SMART. The driver thinks that commands are timing out and that the
drive is becoming unresponsive. Whether they actually are is another
question. Given that this problem changes behavior with the version of
FreeBSD that you're running (and even happens in completely virtual
environments like vmware) I'm betting that it's a driver problem and not
a hardware problem, though you should probably think about migrating
your data off to a new drive sometime soon.

I'd like to attack these driver problems. What I need is to spend a
couple of days with an affected system that can reliably reproduce the
problem, instrumenting and testing the driver. I have a number of
theories about what might be going wrong, but nothing that I'm
definitely sure about. If you are willing to set up your system with
remote power and remote serial, and if we knew a reliable way to
reproduce the problem, I could probably have the problem identified and
fixed pretty quickly.

Scott
_______________________________________________
freebsd-stable@xxxxxxxxxxx mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscribe@xxxxxxxxxxx"



Relevant Pages

  • Re: Planning on building 1st computer
    ... The only thing I'd caution you about is, if it comes with a power supply, ... Plan to spend an entire weekend on assembly, prior to OS install. ... SATA drives are pretty fast. ... Check spacing of mainboard standoffs in case (must match all mounting ...
    (alt.comp.hardware.pc-homebuilt)
  • Re: Totaly toasted :(
    ... My old eyes can't zero in on the small fonts on a 36" LCD TV that was my monitor. ... The power supply has a label on it, with a brand name and model number. ... If you don't actually have an IDE connector inside your new computer, ... Ultra ATA 133/100/66/ hard disk drives to the latest Serial ...
    (alt.comp.hardware.pc-homebuilt)
  • Re: NAS appliance or home server
    ... linux based -- that lurk around here. ... As far as power consumption goes: ... It has two 3.5" hard drives, so I'd budget 10W per drive. ... server side, and would also be able to set up a VCS server. ...
    (uk.comp.os.linux)
  • Re: Configuration Sanity Check Please
    ... That brings the power to 140W. ... If the Vcore converter is 90% efficient, then the input power to the ... the machine first starts, the hard drives draw 2.5A each, from the ... So six drives would draw 15A for the first ten seconds, ...
    (alt.comp.periphs.mainboard.asus)
  • Re: AlphaServer 1200 powering off by itself.
    ... check but you mention you have seven 18GB drives in the machine. ... in another post i mentioned plans to swap power supplies. ... connector to the drive cage. ... on the fans? ...
    (comp.unix.tru64)