Re: ad0 READ_DMA TIMEOUT errors on install of 7.0-RELEASE



On Wed, Feb 27, 2008 at 01:11:36AM -0800, Stephen Hurd wrote:
... The corrupted sync message scared the heck out of me:
Waiting (max 60 seconds) for system process `vnlru' to stop...done
Waiti
Synncgi n(gm adxi sk6s0, svencoodnedss )r efmoari nsiynsgte.m. .pr1o0c ess
`syncer' to stop...8 7 8 3 3 3 1 0 0 0 0 done

http://lists.freebsd.org/pipermail/freebsd-current/2007-October/078145.html
http://lists.freebsd.org/pipermail/freebsd-current/2007-November/079130.html
http://lists.freebsd.org/pipermail/freebsd-current/2007-November/079131.html
http://lists.freebsd.org/pipermail/freebsd-stable/2007-December/038727.html


And after the reboot, the READ_DMA timeouts were back.

You're not the only one seeing this behaviour. There are too many posts
in the past reporting similar. Here's the breakdown:

* Some reporting this problem have been told to replace their ATA or
SATA cables (which have previously been known to be working, but cables
going bad does happen) -- and this has fixed the problem for a couple.

* Some have checked their SMART stats and found their disks to be in
perfect condition.

* Some have switched to alternate operating systems (usually Linux) for
a short while and seen no sign of DMA timeouts.

* Some have replaced the storage controller to no avail, and some have
replaced the entire motherboard to no avail. In some cases (myself
included), replacing the motherboard did in fact help.

However: in your case, your disk does look to have problems based on the
SMART output you provided. It does not matter how new/old the disk is,
by the way. I'll point out the problematic stats. You need to replace
the disk ASAP.

BTW, any SMART stats you see labelled "Offline" means the numbers will
not be updated until you perform an offline test (smartctl -t short or
smartctl -t long).

The only "odd" think I can think of about my system is an unusually high HZ
value (2386) I'm building a kernel now with 1000 to check if that makes a
difference.

This is not the cause, rest assured.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
5 Reallocated_Sector_Ct 0x0033 253 253 063 Pre-fail Always - 4

This shows you've had 4 reallocated sectors, meaning your disk does in
fact have bad blocks. In 90% of the cases out there, bad blocks
continue to "grow" over time, due to whatever reason (I remember reading
an article explaining it, but I can't for the life of me find the URL).

194 Temperature_Celsius 0x0032 253 253 000 Old_age Always - 48

This is excessive, and may be attributing to problems. A hard disk
running at 48C is not a good sign. This should really be somewhere
between high 20s and mid 30s.

195 Hardware_ECC_Recovered 0x000a 253 252 000 Old_age Always - 11498

This implies a large number of ECC (error correction) activities have
occured, but all were successful.

Error 2 occurred at disk power-on lifetime: 5171 hours (215 days + 11 hours)
When the command that caused the error occurred, the device was in an unknown state.
Error 1 occurred at disk power-on lifetime: 5171 hours (215 days + 11 hours)
When the command that caused the error occurred, the device was in an unknown state.

These are automated SMART log entries confirming the DMA failures. The
fact that SMART saw them means that the disk is also aware of said
issues. These may have been caused by the reallocated sectors. It's
also interesting that the LBAs are different than the ones FreeBSD
reported issues with.

My advice to you is: replace the disk ASAP. This problem will only get
worse. Try another hard disk brand too (I don't have anything "against"
Maxtor, but usually its recommended to avoid a brand you have problems
with until the next time you have issues, then switch brands, etc.
etc...). I'm very fond of Western Digital's SE16, RE, and RE2 series
currently. But avoid Fujitsu and Samsung (both have a long track record
of having buggy drive firmwares, forcing vendors to make custom
workarounds for issues); stick with Seagate, Western Digital, or Maxtor.

--
| Jeremy Chadwick jdc at parodius.com |
| Parodius Networking http://www.parodius.com/ |
| UNIX Systems Administrator Mountain View, CA, USA |
| Making life hard for others since 1977. PGP: 4BD6C0CB |

_______________________________________________
freebsd-stable@xxxxxxxxxxx mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscribe@xxxxxxxxxxx"



Relevant Pages

  • Help needed: HD wont mount, etc.
    ... My backup drive had been acting up with directory issues that Disk ... fresh since I had a better partition scheme in mind for it anyway, ... the drive was suffering from some directory issues it couldn't fix. ... reporting the drive is healthy again and tries ...
    (comp.sys.mac.system)
  • Re: Queue upcall locking (was: [dm-devel] [RFC][PATCH] fix dm_any_congested() to properly sync u
    ... are even users reporting having 50 disks in one logical volume or so). ... the individual devices report their congestion state to the aggregate. ... a situation when disk congestion state change is reported to all the ...
    (Linux-Kernel)
  • Re: Removal of bad sectors marked by CHKDSK
    ... > It sounds like a bad hard disk and you should back up pertinent data and replace it. ... > | descriptors to a CHKDSK run...(I checked the box for it to do this, ... > | which is why I cannot figure out why the OS is still reporting them... ... When bad sectors are so marked it is ...
    (microsoft.public.windowsxp.general)
  • Chkdsk problem
    ... While trying to make an image of an NTFS partition, the imaging softwares reported that errors on partition prevented the image to be completed. ... For example, Acronis stated that it "Failed to read from the sector 234,439,534 of the hard disk 1", which seems high to me since there's only 29,304,934 clusters on the disk. ... Back into XP, I performed a chkdsk from a cmd window, and it reported it found "minor" errors. ... chkdsk performed steps 1 to 3 and USN log without reporting any error, but failed right after with a message "Unable to write the second NTFS boot sector" before going on with normal windows boot. ...
    (microsoft.public.windowsxp.general)
  • Chkdsk problem...
    ... While trying to make an image of an NTFS partition, the imaging softwares reported that errors on partition prevented the image to be completed. ... For example, Acronis stated that it "Failed to read from the sector 234,439,534 of the hard disk 1", which seems high to me since there's only 29,304,934 clusters on the disk. ... Back into XP, I performed a chkdsk from a cmd window, and it reported it found "minor" errors. ... chkdsk performed steps 1 to 3 and USN log without reporting any error, but failed right after with a message "Unable to write the second NTFS boot sector" before going on with normal windows boot. ...
    (comp.sys.ibm.pc.hardware.storage)