Re: Problem with twa in HEAD

From: Scott Long (scottl_at_samsco.org)
Date: 04/29/05

  • Next message: Ruslan Ermilov: "Re: [current tinderbox] failure on i386/pc98"
    Date: Fri, 29 Apr 2005 00:57:26 -0600
    To: Vinod Kashyap <vkashyap@amcc.com>
    
    

    Vinod Kashyap wrote:
    >
    >>-----Original Message-----
    >>From: Bjoern A. Zeeb [mailto:bz@FreeBSD.org]
    >>Sent: Tuesday, April 26, 2005 3:26 AM
    >>To: Vinod Kashyap
    >>Subject: RE: Problem with twa in HEAD
    >>
    >>
    >>On Mon, 25 Apr 2005, Vinod Kashyap wrote:
    >>
    >>Hi,
    >>
    >>
    >>>>-----Original Message-----
    >>>>From: Bjoern A. Zeeb [mailto:bz@FreeBSD.org]
    >>>>Sent: Monday, April 25, 2005 6:45 AM
    >>>>To: Vinod Kashyap
    >>>>Subject: Re: Problem with twa in HEAD
    >>>>
    >>>>
    >>>>On Fri, 22 Apr 2005, Bjoern A. Zeeb wrote:
    >>>>
    >>>>Hi,
    >>>>
    >>>>
    >>>>>scottl redirected me to you.
    >>>>>
    >>>>>I am currently debugging "hangs" on reboot and shutdown on a
    >>>>>SMP machine with 12 discs at a
    >>>>>
    >>>>>3ware device driver for 9000 series storage controllers,
    >>>>
    >>>>version: 3.60.00.016
    >>>>
    >>>>>twa0: <3ware 9000 series Storage Controller> port
    >>>>
    >>>>0x9800-0x98ff mem 0xfe8ffc00-0xfe8ffcff,0xfb800000-0xfbffffff
    >>>>irq 28 at device 6.0 on pci3
    >>>>
    >>>>>twa0: [FAST]
    >>>>>twa0: INFO: (0x15: 0x1300): Controller details:: 12 ports,
    >>>>
    >>>>Firmware FE9X 2.06.00.009, BIOS BE9X 2.03.01.051
    >>>>
    >>>>>
    >>>>>What I know so far is that Giant is held by sync.
    >>>>>
    >>>>>Things a "spinning" in cam/cam_xpt.c around:
    >>>>>
    >>>>>--- cam_xpt.c 31 Mar 2005 21:42:49 -0000 1.152
    >>>>>+++ cam_xpt.c 22 Apr 2005 18:42:43 -0000
    >>>>>@@ -3643,6 +3643,7 @@ xpt_polled_action(union ccb *start_ccb)
    >>>>> != CAM_REQ_INPROG)
    >>>>> break;
    >>>>> DELAY(1000);
    >>>>> printf("XXX status=%02x\n",
    >>>>
    >>>>start_ccb->ccb_h.status);
    >>>>
    >>>>> }
    >>>>> if (timeout == 0) {
    >>>>> /*
    >>>>>
    >>>>>
    >>>>>with status being 0x200.
    >>>>>
    >>>>>Seems the twa has a command stuck in it.
    >>>>>
    >>>>>I have seen the comment in dev/twa/tw_osl_cam.c ~ line 253 about
    >>>>>queuing and CAM_SIM_QUEUED but I don't know enough about cam.
    >>>>>I seems no all patchs out of this functions seem to
    >>
    >>clear that from
    >>
    >>>>>status?
    >>>>>
    >>>>>Any help apreaciated ;) I can try patches; as long as I
    >>
    >>can break
    >>
    >>>>>to db> to reboot.
    >>>>
    >>>>further debugging shows that is seems to be spinning in twa_poll.
    >>>>see debug output from TWA_DEBUG 3. The problem is that at
    >>
    >>this point
    >>
    >>>>I am no longer able to break to debugger.
    >>>>
    >>>>twa0: tw_osli_execute_scsi: XPT_SCSI_IO: Single virtual address!
    >>>>twa0: tw_osli_execute_scsi: XPT_SCSI_IO: Single virtual address!
    >>>>unmount of /dev failed (BUSY)
    >>>>twa0: tw_osli_execute_scsi: XPT_SCSI_IO: Single virtual address!
    >>>>twa0: tw_osli_execute_scsi: XPT_SCSI_IO: Single virtual address!
    >>>>Uptime: 2m57s
    >>>>twa0: tw_osli_execute_scsi: XPT_SCSI_IO: Single virtual address!
    >>>>twa0: twa_poll: entering; sc = 0xc57bb200
    >>>>twa0: twa_poll: exiting; sc = 0xc57bb200
    >>>>twa0: twa_poll: entering; sc = 0xc57bb200
    >>>>twa0: twa_poll: exiting; sc = 0xc57bb200
    >>>>twa0: twa_poll: entering; sc = 0xc57bb200
    >>>>twa0: twa_poll: exiting; sc = 0xc57bb200
    >>>>twa0: twa_poll: entering; sc = 0xc57bb200
    >>>>twa0: twa_poll: exiting; sc = 0xc57bb200
    >>>>twa0: twa_poll: entering; sc = 0xc57bb200
    >>>>twa0: twa_poll: exiting; sc = 0xc57bb200
    >>>>twa0: twa_poll: entering; sc = 0xc57bb200
    >>>>twa0: twa_poll: exiting; sc = 0xc57bb200
    >>>>twa0: twa_poll: entering; sc = 0xc57bb200
    >>>>twa0: twa_poll: exiting; sc = 0xc57bb200
    >>>>...
    >>>>
    >>>
    >>>I am in the middle of an office move right now.
    >>>I will get back to you once I have some time to look into this.
    >>
    >>
    >>thanks for the information; I'll be able to test at least until end of
    >>this week and hopefully next week too.
    >>
    >
    >
    > I looked into this, and this is what is happening:
    > On reboot/halt, the following function calling sequence happens:
    > ... --> dashutdown --> xpt_polled_action --> twa_poll.
    > But, the interrupt handler in twa is still active at this time,
    > since twa_detach/twa_shutdown hasn't been called yet. Before
    > twa_poll can fetch the response for the posted command, the ISR
    > gets called when the firmware posts the response. The ISR clears
    > the interrupt bit on the controller, registers a taskqueue handler like
    > it always does, and exits. Meanwhile, xpt_polled_action continues
    > to call twa_poll, which cannot determine that the command has completed,
    > since the interrupt bit on the controller is already cleared. So,
    > we get into a (near) never-ending loop (the timeout for scsi_synchronize_cache,
    > which is what is being tried here, is, for whatever reason, 60 minutes,
    > and so, the system is as good as hung).
    >
    > Now, does anyone know why xpt_polled_action is being called from
    > dashutdown, even before the ISR has been unregistered (via twa_detach)?
    >
    > Bjoern, this patch should work-around your problem, although it's not
    > the fix. Also, it still leaves a window for the race condition described
    > above.
    >

    xpt_polled_action() expects that it can simulate interrupts by calling
    the driver poll vector, and that by calling it enough times the driver
    will eventually complete all the outstanding I/O it has. As you note,
    it'll repeat this for a very long time. So the question is then why the
    twa driver isn't completing the outstanding I/O. If I were you I'd
    remove the call to tw_cl_interrupt() in twa_poll() and just
    unconditionally call tw_cl_deferred_interrupt() and have it check
    everything. The locking here (and in twa_pci_intr()) is flawed anyways,
    you have a race between when tw_cl_interrupt() drops its lock right
    before return and when you check it's return value. I'd like say that
    it's harmless, except that you expect to pass state from one function to
    the next, so the race is a real one. It's likely why this case is
    failing. An ideal FAST handler should only clear the hardware interrupt
    register and launch the appropriate handlers, it shouldn't try to pass
    state to the handlers. Look at aac for an example here, but also please
    recall that I've already discouraged you from using a a fast handler
    plus taskqueue for this driver. If your taskqueue handlers need state
    from when the interrupt was cleared, then they simply aren't a good
    candidate for this model.

    Scott
    _______________________________________________
    freebsd-current@freebsd.org mailing list
    http://lists.freebsd.org/mailman/listinfo/freebsd-current
    To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"


  • Next message: Ruslan Ermilov: "Re: [current tinderbox] failure on i386/pc98"

    Relevant Pages

    • Re: [patch 4/4] genirq: add support for threaded interrupt handlers
      ... _all_ drivers to have their interrupt handlers automagically called from ... process context with no driver changes. ... the threaded interrupt handler model contrary to the preempt-rt patch ... stuff that now needs softirq could be directly done in the ...
      (Linux-Kernel)
    • Re: em network issues
      ... that the INTR_FAST handler provided a very large benefit. ... difference in that driver. ... spinlock for all APICs, so you can get contention with multiple CPUs ... The big win came from moving the locking outside of the basic interrupt ...
      (freebsd-net)
    • Re: 2.6.31-rt11 freeze on userland start on ARM
      ... The goal was making its interrupt handler suitable for -rt as well as ... The interface to the generic serial driver is ...
      (Linux-Kernel)
    • Re: [PATCH RFC] e1000: clear ICR before requesting an IRQ line
      ... that the request_irq prints a warning if after calling the handler it ... int request_irq(unsigned int irq, irq_handler_t handler, ... I discovered that the e1000 driver handles the "fake" interrupt, which, ...
      (Linux-Kernel)
    • Re: em network issues
      ... driver avoids this problem. ... locking, but for many drivers fast wth interrupt handlers, whatever ... locking is used by the fast interrupt handler must be used all over ... This is safe because it has no side effects and doesn't take long. ...
      (freebsd-net)