Re: aac(4) handling of probe when no devices are there



On Mon, Dec 14, 2009 at 4:47 PM, Alexander Sack <pisymbol@xxxxxxxxx> wrote:
Hello Again:

I guess I have a technical question/concern that I was looking for
feedback.   During the probe sequence, aac(4) conditionally responds
to INQUIRY commands depending on target LUN:

aac_cam.c/aac_cam_complete():
532                         if (command == INQUIRY) {
533                                 if (ccb->ccb_h.status == CAM_REQ_CMP) {
534                                 device = ccb->csio.data_ptr[0] & 0x1f;
535                                 /*
536                                  * We want DASD and PROC devices to only be
537                                  * visible through the pass device.
538                                  */
539                                 if ((device == T_DIRECT) ||
540                                     (device == T_PROCESSOR) ||
541                                     (sc->flags & AAC_FLAGS_CAM_PASSONLY))
542                                         ccb->csio.data_ptr[0] =
543                                             ((device & 0xe0) | T_NODEVICE);
544                                 } else if (ccb->ccb_h.status ==
CAM_SEL_TIMEOUT &&
545                                         ccb->ccb_h.target_lun != 0) {
546                                         /* fix for INQUIRYs on Lun>0 */
547                                         ccb->ccb_h.status =
CAM_DEV_NOT_THERE;
548                                 }
549                         }

Why is CAM_DEV_NOT_THERE skipped on LUN 0?  This is true on my target
6.1-amd64 machine as well as CURRENT.  The reason why I ask this is
because now that aac(4) is sequential scanned, there are a lot of cam
interrupts that come in on my 6.x machine where the threshold is only
500 and I get the interrupt storm threshold warning for swi2 pretty
quickly:

Interrupt storm detected on "swi2:"; throttling interrupt source

Obviously its contingent on the number of adapters you have on your
system.  On CURRENT I didn't see this because the threshold is double
(I think its a 1000 by default).

The issue is the number of xpt_async(AC_LOST_DEVICE, ..) calls during
the scan.  The probe sequence in CURRENT as well as 6.1 handles
CAM_SEL_TIMEOUT a little differently depending on context.

scsi_xpt.c/probedone():
1090                 } else if (cam_periph_error(done_ccb, 0,
1091                                             done_ccb->ccb_h.target_lun > 0
1092                                             ? SF_RETRY_UA|SF_QUIET_IR
1093                                             : SF_RETRY_UA,
1094                                             &softc->saved_ccb) ==
ERESTART) {
1095                         return;
1096                 } else if ((done_ccb->ccb_h.status & CAM_DEV_QFRZN) != 0) {
1097                         /* Don't wedge the queue */
1098                         xpt_release_devq(done_ccb->ccb_h.path, /*count*/1,
1099                                          /*run_queue*/TRUE);
1100                 }
1101                 /*
1102                  * If we get to this point, we got an error status back
1103                  * from the inquiry and the error status doesn't require
1104                  * automatically retrying the command.  Therefore, the
1105                  * inquiry failed.  If we had inquiry information before
1106                  * for this device, but this latest inquiry command failed,
1107                  * the device has probably gone away.  If this device isn't
1108                  * already marked unconfigured, notify the peripheral
1109                  * drivers that this device is no more.
1110                  */
1111                 if ((path->device->flags & CAM_DEV_UNCONFIGURED) == 0)
1112                         /* Send the async notification. */
1113                         xpt_async(AC_LOST_DEVICE, path, NULL);
1114
1115                 xpt_release_ccb(done_ccb);
1116                 break;
1117         }

But on cam_periph_error(), this will issue a xpt_async(AC_LOST_DEVICE,
path, NULL) regardless of whether or not the device has been scene
already (as per the comment above), i.e. on every initial bus scan,
you will get into (on an aac(4) card with LUN > 0):

cam_periph.c/cam_periph_error():
1697         case CAM_SEL_TIMEOUT:
1698         {
.
.
1729                 /*
1730                  * Let peripheral drivers know that this device has gone
1731                  * away.
1732                  */
1733                 xpt_async(AC_LOST_DEVICE, newpath, NULL);
1734                 xpt_free_path(newpath);
1735                 break;

Is this really right? This generates A LOT of interrupts noise when no
devices are attached during the initial scan, i.e. we are treating the
initial scan of failed INQUIRY commands on the SCSI BUS as if we
really lost a device during a selection timeout.  (we even generate a
path to issue the async event).

I should have properly titled the thread a little bit better, but
basically we always generate a ton of software CAM interrupts during a
LUN scan for targets on aac(4) that do not really exist (i.e. nothing
is truly there). We do this because we treat the initial INQUIRY sent
down equivalent to a selection timeout instead of the device is not
really there. There seems to be an historical workaround for part of
this issue but I am trying to delve deeper in order to do the *right
thing* for our 6.1 deployments (as well as 7.x and CURRENT).

-aps
_______________________________________________
freebsd-current@xxxxxxxxxxx mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe@xxxxxxxxxxx"