Re: DANGER WILL ROBINSON! SERIOUS problem with current 5.4-PRERELEASE - UPDATE (real this time)

From: Karl Denninger (karl_at_denninger.net)
Date: 03/31/05

  • Next message: Greg 'groggy' Lehey: "Re: Problems with AMD64 and 8 GB RAM?"
    Date: Wed, 30 Mar 2005 23:00:46 -0600
    To: "Matthew N. Dodd" <mdodd@freebsd.org>
    
    

    Ok, here's what I've got so far.

    Pulling the SECOND delta both gets rid of the stability problem AND the
    requeue fix (e.g. getting rid of that denies the essential purpose of the
    deltas in the first place.)

    Removing the FIRST delta, which is:

    218a219,221
           if (!dumping)
               callout_reset(&request->callout, request->timeout * hz,
                             (timeout_t*)ata_timeout, request);

    appears to get rid of the crashes while not harming data integrity OR the
    reqeueing.

    With this one out the errors (I was able to general over a dozen retries in
    less than 10 minutes doing a large file copy with a 3-disk RAID 1 array
    comprised of 2 SATA disks, 1 UDMA100) still occur, BUT they are retried
    (apparently successfully.)

    I copied the source tree to /usr/src2 and took the errors. I am now
    attempting to "buildworld" off it - so far, so good (about 1/4 of the way
    through - if there was data corruption it should have failed by now)

    Also, the sandbox system is still up. That also is a major improvement.

    I will let this buildworld complete, and if it is successful (proving that
    the retried errors didn't actually result in corrupted files!), will put
    this same change (pulling the first delta only) on the production system,
    rebuild the other RAID disks (I had to pull the cartridges from there to
    use them on the sandbox) and see if intentionally provoking the same
    error there allows the system to remain stable once the errors start
    showing up.

    Again, I will not have a "final" determination on this until late tomorrow,
    but at first blush pulling the first delta appears to fix the stability
    issue.

    Further update tomorrow as soon as I have it....

    --
    -- 
    Karl Denninger (karl@denninger.net) Internet Consultant & Kids Rights Activist
    http://www.denninger.net	My home on the net - links to everything I do!
    http://scubaforum.org		Your UNCENSORED place to talk about DIVING!
    http://www.spamcuda.net		SPAM FREE mailboxes - FREE FOR A LIMITED TIME!
    http://genesis3.blogspot.com	Musings Of A Sentient Mind
    On Wed, Mar 30, 2005 at 09:08:30PM -0600, Karl Denninger wrote:
    > On Tue, Mar 29, 2005 at 11:43:18PM -0600, Karl Denninger wrote:
    > > Here's the diff and some thoughts....
    > > 
    > > Fs:/usr/src/sys/dev/ata> cvs diff -r 1.32.2.5 ata-queue.c
    > > Index: ata-queue.c
    > > ===================================================================
    > > RCS file: /usr/cvs/src/sys/dev/ata/ata-queue.c,v
    > > retrieving revision 1.32.2.5
    > > retrieving revision 1.32.2.6
    > > diff -r1.32.2.5 -r1.32.2.6
    > > 30c30
    > > < __FBSDID("$FreeBSD: src/sys/dev/ata/ata-queue.c,v 1.32.2.5 2004/10/24 09:27:37 sos Exp $");
    > > ---
    > > > __FBSDID("$FreeBSD: src/sys/dev/ata/ata-queue.c,v 1.32.2.6 2005/03/23 04:50:26 mdodd Exp $");
    > > 218a219,221
    > > >       if (!dumping)
    > > >           callout_reset(&request->callout, request->timeout * hz,
    > > >                         (timeout_t*)ata_timeout, request);
    > > 241,243c244,249
    > > < 
    > > <       /* if reinit succeeded and retries still permit, reinject request */
    > > <       if (ata_reinit(ch) && request->retries-- > 0 && request->device->param){
    > > ---
    > > >       /*
    > > >        * if reinit succeeds, retries still permit and device didn't
    > > >        * get removed by the reinit, reinject request
    > > >        */
    > > >       if (!ata_reinit(ch) && request->retries-- > 0
    > > >           && request->device->param){
    > > 245a252
    > > >           request->donecount = 0;
    > 
    > Removing the second change (changing the test on the "ata_reinit") appears to 
    > prevent both the destabilization and the actual requeue from taking place 
    > (that is, you get the immediate disconnect from the array when the error 
    > occurs; therefore whatever is causing the destabilization doesn't happen.)
    > 
    > I will attempt to remove the first delta alone (and put back the second), but 
    > from a quick perusal of the code I doubt this will make a material change.
    > 
    > --
    > -- 
    > Karl Denninger (karl@denninger.net) Internet Consultant & Kids Rights Activist
    > http://www.denninger.net	My home on the net - links to everything I do!
    > http://scubaforum.org		Your UNCENSORED place to talk about DIVING!
    > http://www.spamcuda.net		SPAM FREE mailboxes - FREE FOR A LIMITED TIME!
    > http://genesis3.blogspot.com	Musings Of A Sentient Mind
    > 
    > 
    > _______________________________________________
    > freebsd-stable@freebsd.org mailing list
    > http://lists.freebsd.org/mailman/listinfo/freebsd-stable
    > To unsubscribe, send any mail to "freebsd-stable-unsubscribe@freebsd.org"
    > 
    > 
    > %SPAMBLOCK-SYS: Matched [freebsd], message ok
    _______________________________________________
    freebsd-stable@freebsd.org mailing list
    http://lists.freebsd.org/mailman/listinfo/freebsd-stable
    To unsubscribe, send any mail to "freebsd-stable-unsubscribe@freebsd.org"
    

  • Next message: Greg 'groggy' Lehey: "Re: Problems with AMD64 and 8 GB RAM?"

    Relevant Pages

    • Re: Air Force Signs Off on SRB-CEV
      ... why not get rid of the ... Because Delta IV is the most versatile of the three. ... >If you want to keep the more reliable launch vehicle, ...
      (sci.space.policy)
    • Re: [git] CFS-devel, group scheduler, fixes
      ... if (delta> 0) ... That would rid us of most of the funny conditionals there. ... lock my box up, but I quickly got the below, so hastily killed it. ...
      (Linux-Kernel)