Re: ad10: WARNING - READ_DMA UDMA ICRC error (retrying request) LBA=11441599

From: Karl Denninger (karl_at_denninger.net)
Date: 08/11/05

  • Next message: Mark Andrews: "IPv6 router solicitation not being received"
    Date: Wed, 10 Aug 2005 18:46:59 -0500
    To: freebsd-stable@freebsd.org
    
    

    On Thu, Aug 11, 2005 at 12:46:04AM +0200, S?ren Schmidt wrote:
    >
    > On 10/08/2005, at 22:51, Karl Denninger wrote:
    > >
    > >This is the subject of the PR I filed back in February.
    > >
    > >Again, if you want either a controller shipped to you OR access to a
    > >development machine (e.g. ssh in and play) which has the suspect
    > >configuration on it, the latter of which is probably the best
    > >option (since
    > >making it fail is simple) I'm willing to provide either - my only
    > >caveat is
    > >that if I send hardware I want it back when you're done, and I
    > >believe its
    > >reasonable to expect that 6.0 will get HELD in its release cycle
    > >until this
    > >is resolved.
    >
    > I have plenty of the sii3112's around, so thats not needed, however
    > I've not managed to get ahold of a machine in which it fails reliably
    > with ATA as is in 6.0.

    I have two which reliably fail if you put TWO disks on them in a gmirror
    config within minutes of starting a "make buildworld". With one disk
    it takes a bit longer and more effort, but can still be forced to fail.

    It appears to require a mix of read and write operations and a fairly heavy
    - but not horiffically so - I/O load to make it blow up.

    All reads or all writes do NOT fail. For example, you can do a gmirror
    rebuild and it will succeed. That's all writes (to the new disks) until
    complete. Seconds to minutes after the rebuilds complete if the system
    is under heavy random I/O load it will fail.

    >From this and other tests I've concluded that a MIX of read and write
    operations are required, and the total load must be substantial. Either
    reads alone or writes alone do not appear to provoke it, even with 100%
    disk utilization.

    > >The latter offer (ssh access) has been on the table for several
    > >months. The
    > >former I just put on the table as I threw up my hands and bought a
    > >3ware
    > >card - which means I now have TWO of the suspect cards and need
    > >only one
    > >for my own testing (in the sandbox)
    > >
    > >I'm willing to go WELL out of my way to make it possible for this
    > >to get
    > >fixed, since there appears to be an issue with access to hardware that
    > >breaks reliably. However, I, and others, would like to know that
    > >we're
    > >going to see the problem get resolved.
    >
    > I've already gone WAY out of my way to try to support the sii3112,
    > and I'm not inclined to waste more of my precious spare time on it.
    > However, if it really is that important to enough people to try to
    > workaround the silicon bugs (which very likely isn't possible), get
    > together and get me failing HW on my desk and time to work on it.

    Ok, then do the RIGHT THING and document that the SiI chips are declared
    BROKEN by FreeBSD and likely to cause people trouble - including irrevocable
    data corruption.

    This would have saved me COUNTLESS hours when I first ran into this
    issue. Indeed, it was not until someone else started posting excerpts
    from commit logs (months after I filed the PR originally!) that I was
    aware FreeBSD developers considered these chipsets "damaged goods".

    Where is fair warning in the hardware compatability guide?

    Second, your requirement for <BOTH> hardware <AND TIME> simply can't be
    met. It is not possible for anyone to manufacture or deliver time.

    Is it thus necessary for us "mere users" to consider this an issue that
    will simply not be addressed? If so, then just say so up front <AND
    DOCUMENT THAT THE SII CHIPSETS DON'T WORK RIGHT.>

    > >Again - this is hardware that is STABLE and works under 4.x - in
    > >the case of
    > >my specific configuration I ran under 4.x for over a year without a
    > >single
    > >incident. With 5.4 and 6.0-BETA I can kill it inside of 2 minutes
    > >with
    > >nothing more complicated than a "make -j4 buildworld".
    >
    > First. you cannot by any degree of the word call the sii3112 for
    > STABLE hardware, its broken beyond repair or workarounds, and even
    > the supplier acknowledges that fact.

    Well then how about if FreeBSD officially DECLARES this hardware to be
    broken beyond repair and workaround, and simply says "if this doesn't work
    for you, don't bitch or complain, because we have nothing further we can
    do about it"?

    That is acceptable, although I bet it costs 'ya a fair number of users,
    particularly in the small server and workstation markets. Of course since
    its not "money lost", that may be perfectly OK to the FreeBSD team.

    It definitely will change MY focus as a developer of software often run on
    small office and home network machines though. It HAS TO Soren.

    This isn't a matter of me not wanting to be a FreeBSD evangelist - but
    if I try to tell people that half of the machines out there that they might
    run FreeBSD on are likely to fail, and if they do my only recommendation is
    "sorry, I can't do anything about it other than sell you this hardware", the
    obvious next reply is that they will want the software to be made available
    on an operating system that DOESN'T blow up like this. Linux ends up being
    something I have to support of necessity down that road...... (a thing I've
    studiously avoided now for five years, by the way.)

    I have a 3ware card in my production machine now and the "allegedly broken"
    disks are magically just fine. Guess the disks are fine eh? Of course I
    lost the functionality that I thought I was getting with the newer ATA code
    anyway, since the 3ware software doesn't support hot plug, and I also lost
    access to the disk statistics and self-test capabilities that smartmontools
    has, since 3ware's board doesn't pass that through cleanly either.

    But all this begs the question - why did it work on 4.x, and how come the
    same timing constraints and code paths that worked on 4.x can't / weren't
    incorporated into what's there now?

    > Second, you cannot possibly have used gmirror (as in the PR) on 4.x
    > so what was the config back then ?

    I didn't NEED gmirror back then. Attempting to use these disks on a SiI
    controller WITHOUT gmirror in 5.4 or even 6.0 is asking to have to reload
    the machine as the errors cause irrevocable data corruption.

    I'm not about to subject myself to having to reload a machine a few hundred
    times while troubleshooting it, and I suspect you know that is a completely
    unreasonable request.

    Gmirror was added to my config in an attempt to stop the crashes during
    testing - with at least one disk in the mirror on the ICH5 adapter the
    system (and data) survives. It turns out that on 5.x this is much more
    "reasonable" to use than vinum, which was severely broken in 5.x (may
    be fixed now as "gvinum", I didn't give it anoyther crack after pulling my
    hair out for quite a long time with THAT one.)

    I assure you that the load profiles that generate BOOMs on 5.4 and 6.0-BETA
    do NOT under 4.x with the IDENTICAL hardware in use. Over a year of heavy
    production use of 4.x with ZERO trouble is my evidence for this.

    > Third, please get gmirror out of the loop (use atacontrol to create a
    > mirror if need be) and let me know if that changes anything.

    Uh, if the abstraction done by GEOM is hardware-independant, and the error
    comes from the DRIVER, how can GEOM be involved?

    GEOM (gmirror in this case) prevents me from having to reload the machine
    every time it blows up due to data corruption that cannot be fixed.

    Never mind that others are reporting irrevocable data loss and crashes -
    they aren't mirrored..... I've managed to keep my data intact....

    "atacontrol" doesn't help me as there is no rebuild mechanism available
    for "garden variety" controllers (at least the last time I tried it that
    did nothing.) So you can build the array but after the first crash you
    had no way to recover.

    That's only marginally better than having the crash wipe the sidewalk with
    the data on your drive, in terms of troubleshooting effort.

    > Forth, another thing to try is fumbling with BIOS settings, some
    > setups has been reported to start working when PCI timings is changed
    > YMMV..
    >
    > - S?ren

    I can play with this.... but if the hardware is the cause and requires
    tweaking timing in the PCI BIOS config, how come 4.x works without any
    tweaking on the same hardware?

    In short, what's changed in the DRIVER timing that provokes this sort of
    thing , and does it NEED to have changed?

    Again, I can easily set up ssh access to a machine that has problems with
    this, and the "BOOM"s are VERY repeatable.

    >From the other postings here, I am by no means an isolated user with an
    isolated problem - the issue is fairly widespread.

    I suspect, but of course cannot prove, that if you find the issue with my
    machine, you will likely fix a lot of other people's issues with similar
    problems...... I could be wrong, but I bet not......

    In any event the ATA code changes have hurt a LOT of people Soren and led to
    a huge amount of wasted time. If it was known that the SiI chipsets simply
    were never going to get full support (because they are considered
    "unsupportable") then it is only right for the development team to DOCUMENT
    THIS rather than letting people find out for themselves the hard way,
    pulling their hair out looking for phantom bad disk drives and phantom
    problems with cables - neither of which has anything to do with it.

    If there is going to be no path out of this mess then just say so and
    we'll realign our expectations of where FreeBSD fits in terms of what
    environments it is reasonable to consider it for.

    --
    -- 
    Karl Denninger (karl@denninger.net) Internet Consultant & Kids Rights Activist
    http://www.denninger.net	My home on the net - links to everything I do!
    http://scubaforum.org		Your UNCENSORED place to talk about DIVING!
    http://genesis3.blogspot.com	Musings Of A Sentient Mind
    _______________________________________________
    freebsd-stable@freebsd.org mailing list
    http://lists.freebsd.org/mailman/listinfo/freebsd-stable
    To unsubscribe, send any mail to "freebsd-stable-unsubscribe@freebsd.org"
    

  • Next message: Mark Andrews: "IPv6 router solicitation not being received"

    Relevant Pages

    • RE: Anthonys drive issues.Re: ssh password delay
      ... The dmesg you sent indicated that the 2 disks were negotiating at ... > possible cause in the universe before blaming it on FreeBSD. ... to take the risk of it being hardware, ... believe is that it's a bug in the FreeBSD driver. ...
      (freebsd-questions)
    • Re: Corrupt data - RAID sata_sil 3114 chip
      ... sil3114 is known to cause data corruption with some disks. ... data corruption problems without even being warned about it. ... If the corruption was happening on all such controllers then people ... kind of hardware fault or combination of hardware which is causing the ...
      (Debian-User)
    • Re: Is FreeBSD ready for desktop (Mozilla Flash)
      ... A number of hardware vendors ... > happen to be using a hardware/software combination blessed by Macromedia. ... >> layer for running the Linux version of the plugin exists. ... copies of FreeBSD running on i386 than on any of the other hardware ...
      (comp.unix.bsd.freebsd.misc)
    • Re: Quality of FreeBSD
      ... And wouldn't mind to wait longer for real production quality ... on the hardware you know ... and FreeBSD users to do some of the testing. ... This change will help shake out software bugs relating to ...
      (freebsd-stable)
    • Re: i give up
      ... The list of supported hardware is often written in terms ... I've installed ethernet cards named "compex" to PCs and they worked well ... compatible with FreeBSD. ... nVidia sucks for use on Unix platforms. ...
      (freebsd-current)