Re: stray irq13 at runtime

From: Bruce Evans (bde_at_zeta.org.au)
Date: 05/30/04

  • Next message: Danny Braniss: "Re: /usr/local/etc/rc.conf"
    Date: Sun, 30 May 2004 21:44:34 +1000 (EST)
    To: Kris Kennaway <kris@obsecurity.org>
    
    

    On Sat, 29 May 2004, Kris Kennaway wrote:

    > Since updating the i386 package machines the other day, they've all
    > experienced the following:
    >
    > May 29 21:24:53 <user.err> gohan28 kernel: stray irq13
    >
    > irq13: npx0 2 0
    > stray irq13 1 0
    >
    > This is not appearing during boot - those machines have been up for
    > hours before the interrupt occurs.

    This is probably harmless.

    There's some bug in APIC mode that causes a stray irq13 to be delivered
    earlier on my systems. Perhaps you are getting this same stray irq13
    delivered later. You also have an extra non-stray irq13. There should
    be exactly 1 irq13 delivered ever, except on 386 and 486SX systems
    applications can generate any number.

    I debugged some of this. APIC mode seems to behave differently because:
    (1) the APIC responds much more slowly than the PIC (after 2349 instead
        of 57 iterations in the enclosed debugging code on an Athlon XP2600)
    (2) the not-so-new interrupt code broke the hack that prevented getting
        interrupts after bus_teardown_intr(). These are reported as stray
        interrupts. There was a completely different bug (non-atomic update
        of the interrupt name and/or count pointers) which caused non-stray
        npx (and possibly other, but always for npx) interrupts to be
        reported as stray, so the hack hasn't helped for a year or two if
        it ever did.

    npx_probe() tests whether exceptions are reported by traps or interrupts
    by causing an unmasked exception and checking whether this causes a trap
    or interrupt. Normally when there's a trap there is an interrupt too.
    Traps occur synchronously, but interrupts occur asynchronously, especially
    since we don't synchronize with the FPU^WNPX. We do an fnop after dividing
    by 0 to trigger reporting the exception. The NPX and CPU continue
    asynchronosly. Thus we have a race. The size of the race window is
    apparently related to [A]PIC hardware, so it has become large enough
    relative to CPU speeds to cause problems on fast CPUs with high-latency
    [A]PICs. OTOH, we can easily synchronize better using fwait instead of
    fnop. The reasons for using fnop instead of fwait (only FUD?) don't seem
    to apply any more. Changing from fnop to fwait gets the interrupt delivered
    after 49 iterations instead of 2349 in the enclosed debugging code). This
    is still much longer than I'd like. 49 iterations is still over 100 cycles,
    and there are hundreds more cycles between the fwait and the delivery of
    the irq13 for trap and interrupt handling. Something must wait for irq13
    delivery so that irq13's don't get seen by the wrong thread (if they are
    used at all), but other parts of npx.c don't even know if they might have
    to wait.

    Fixes and debugging code:

    % Index: npx.c
    % ===================================================================
    % RCS file: /home/ncvs/src/sys/i386/isa/npx.c,v
    % retrieving revision 1.148
    % diff -u -2 -r1.148 npx.c
    % --- npx.c 11 May 2004 20:14:53 -0000 1.148
    % +++ npx.c 30 May 2004 10:39:19 -0000
    % @@ -105,5 +105,5 @@
    % #define fnstcw(addr) __asm __volatile("fnstcw %0" : "=m" (*(addr)))
    % #define fnstsw(addr) __asm __volatile("fnstsw %0" : "=m" (*(addr)))
    % -#define fp_divide_by_0() __asm("fldz; fld1; fdiv %st,%st(1); fnop")
    % +#define fp_divide_by_0() __asm("fldz; fld1; fdiv %st,%st(1); fwait")
    % #define frstor(addr) __asm("frstor %0" : : "m" (*(addr)))
    % #ifdef CPU_ENABLE_SSE

    This changes from fnop to fwait, to synchronize better. See above.

    % @@ -369,4 +369,19 @@
    % npx_traps_while_probing = npx_intrs_while_probing = 0;
    % fp_divide_by_0();
    % +#ifdef DEBUG
    % + {
    % + int i;
    % +
    % + for (i = 0; i < 10000000; i++)
    % + if (npx_intrs_while_probing != 0) {
    % + device_printf(dev,
    % + "saw intr after %d iterations\n",
    % + i);
    % + break;
    % + }
    % + }

    This determines latency of irq13 delivery.

    % +#else
    % + DELAY(1000); /* wait for any IRQ13 */
    % +#endif

    Waiting this long should always work.

    % if (npx_traps_while_probing != 0) {
    % /*
    % @@ -407,4 +422,5 @@
    % bus_teardown_intr(dev, irq_res, irq_cookie);
    %
    % +#if 0
    % /*
    % * XXX hack around brokenness of bus_teardown_intr(). If we left the
    % @@ -417,4 +433,5 @@
    % isrc->is_pic->pic_disable_source(isrc);
    % }
    % +#endif

    bus_teardown_intr() still doesn't disable the interrupt, at least in the
    edge-triggered case, but neither does this hack (in either the PIC or APIC
    case), since isrc->is_pic->pic_disable_source() is a no-op for
    edge-triggered interrupts and irq13 is normally edge-triggered.

    %
    % bus_release_resource(dev, SYS_RES_IRQ, irq_rid, irq_res);

    I haven't figured out why the APIC case normally delivers both a normal
    (fast) interrupt and stray interrupt when we don't wait for the one
    interrupt that actually occurs. One is counted as stray because it
    occurs after the bus_teardown_intr(), but both of them seem to occur
    after that. So there seems to be a race or double counting somewhere.

    Bruce
    _______________________________________________
    freebsd-current@freebsd.org mailing list
    http://lists.freebsd.org/mailman/listinfo/freebsd-current
    To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"


  • Next message: Danny Braniss: "Re: /usr/local/etc/rc.conf"

    Relevant Pages

    • Re: Stray irq7.
      ... The first is a normal consequence of the npx probe for the non-SMP ... Either there is an interrupt pending when irq13 is enabled (due to ISA ...
      (freebsd-current)
    • Re: stray irq13 at runtime
      ... >> hours before the interrupt occurs. ... One is counted as stray because it ... triggering leaves irq13 enabled even when its handler has been torn ... Further irq13s for unmasked NPX exceptions don't happen for the APIC ...
      (freebsd-current)
    • Re: [PATCH] 2.6.18-rt7: PowerPC: fix breakage in threaded fasteoi type IRQ handlers
      ... delaying the delivery of the interrupt to a thread while restoring the ... The only detail is that sometimes a threaded flow /cannot/ be ... CPU priority in the PIC thus allowing processing of further ...
      (Linux-Kernel)
    • Re: Ghetto-debug in new -CURRENT with SCSI controller
      ... > If this is an ACPI problem, then it is likely related to interrupt ... > delivery. ... you're screen shots aren't enough to diagnose ...
      (freebsd-current)
    • Re: [PATCH 0/7] Boot IRQ quirks and rerouting
      ... because nobody cares about the interrupt. ... sorry -- making the interrupt be discarded through the ... it could be possible to mask the interrupt by fiddling with the ... inter-APIC bus (the vector, delivery mode, etc. are generally not meant to ...
      (Linux-Kernel)