Re: "ad0: TIMEOUT - WRITE_DMA" type errors with 7.0-RC1



On Fri, Jan 25, 2008 at 08:58:41AM -0700, Joe Peterson wrote:
I've seen mention of this kind of issue before, but I never saw a
solution, except that someone reported that a certain version of 6.x
seemed to make it go away - accounts of this problem are a bit vague. I
am running 7.0-RC1, and I am seeing the errors periodically, and I am
wondering if this is a known issue. Note that smartctl does not report
errors logged and gives a "PASSED" to the drive. I am running at
UDMA100 ATA. Also, if it matters, I am using ZFS.

What you've shown is usually the sign of a disk-related problem. It's
very obvious when it's just one disk reporting DMA errors. You use ZFS,
so chances are you have more than one disk in a pool/volume -- there's
no indication ad1, ad4, ad6, etc. are failing, so this seems to indicate
something specific to ad0.

Manufacturers pick very passive (non-aggressive) thresholds for error
conditions on disks, so disks which are failing very commonly show
"PASSED" during SMART analysis. To make matters worse, most users I
know read SMART stats incorrectly (they're easy to misinterpret).

Can you please provide output of the following:

* smartctl -a /dev/ad0
* atacontrol cap ad0
* atacontrol info <ata0, ata1, etc. -- any controller used by ZFS>
* Relevant dmesg output that indicates what kind of ATA controller
these disks are attached to. Start with output from 'ad0:' and
work backwards. For example, ad0 on this machine is using an Intel
ICH6 controller:
atapci0: <Intel ICH6 SATA150 controller> port 0x1f0-0x1f7,0x3f6,0x170-0x177,0x376,0xf000-0xf00f at device 31.2 on pci0
ata0: <ATA channel 0> on atapci0
ad0: 238475MB <WDC WD2500KS-00MJB0 02.01C03> at ata0-master SATA150

Other stuff:

SMART stats which are labelled "Offline" are only updated when a short
or long offline test is performed. Have you tried using "smartctl -t
short /dev/ad0" and "smartctl -t long /dev/ad0" to see if any of the raw
values on the far right column increment?

Have you tried using "zpool scrub" on the ZFS pool, then "zpool status"
to see if READ/WRITE/CHKSUM counters increment or if the "scrub" line
states there were errors?

Other things which have fixed problems in the past for others:

* BIOS updates
* Change of motherboards (sometimes replacing board with same model,
other times going with a completely different vendor (implies weird
implementation issues or BIOS problems))
* Changing SATA cables
* Getting a larger power supply (usually when lots of disk are involved)

--
| Jeremy Chadwick jdc at parodius.com |
| Parodius Networking http://www.parodius.com/ |
| UNIX Systems Administrator Mountain View, CA, USA |
| Making life hard for others since 1977. PGP: 4BD6C0CB |

_______________________________________________
freebsd-stable@xxxxxxxxxxx mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscribe@xxxxxxxxxxx"



Relevant Pages

  • Re: Programming in standard c
    ... functions would do - report size at time of inspection. ... the "size" of a compressed, or partially compressed, or sparse file? ... 10GB which is the "data size" required to store it on another file system ... You can't tell this from the size on disk, ...
    (comp.lang.c)
  • Re: Divining the full pathname of a file, all logicals translated
    ... then all that counts is if files under that set of paths changes or ... What possible difference could changing the physical disk make, ... compare it to the "current" state when an audit report is run. ...
    (comp.os.vms)
  • Re: Win2k3 R2 - Storage Reporting on File Svr Cluster (SCSI-Attach
    ... adapters and create the shared disk fine, but the native MS disk resource ... to generate reports on shared disks for which a cluster disk resource ... i get the error that it's on a cluster server. ... online that you want to report on. ...
    (microsoft.public.windows.server.clustering)
  • Re: Stop Error Message: 0X0000007A(OXC03E10A8.....................
    ... enough to read the report. ... It could be that a minor case of disk corruption caused ... Run Chkdsk /f /r on the system partition. ... chkdsk may be a bad decision if disk corruption is suspected. ...
    (microsoft.public.windowsxp.setup_deployment)
  • Re: chkdsk /f errors in Event Log
    ... A disk check has been scheduled. ... Pro the day after purchase. ... I can still get one clean "no problems found" report if I run ... I booted from the original XP Pro CD and ran chkdsk ...
    (microsoft.public.windowsxp.general)