Need help diagnosing hardware failure

From: Doug Poland (doug_at_polands.org)
Date: 05/10/04

  • Next message: Bill Campbell: "Re: Help: Tip on Buying External modem"
    Date: Mon, 10 May 2004 10:52:19 -0500
    To: questions@freebsd.org
    
    

    Hello,

    Upon returning from a weeks vacation, I was dismayed to find my home
    file server (running 4.8-STABLE) had crashed. The box in question has
    an Adaptec Host adapter

    ahc0: <Adaptec 2940A Ultra SCSI adapter> port 0xf800-0xf8ff mem 0xfedfe000-0xfedfefff irq 10 at device 13.0 on pci0
    aic7860: Ultra Single Channel A, SCSI Id=7, 3/253 SCBs

    and seven identical SCSI drives

    judeah# dmesg | grep IBMRAID
    da0: <IBMRAID 0664M1H9337 5 58> Fixed Direct Access SCSI-2 device
    da1: <IBMRAID 0664M1H9337 5 58> Fixed Direct Access SCSI-2 device
    da2: <IBMRAID 0664M1H9337 5 58> Fixed Direct Access SCSI-2 device
    da3: <IBMRAID 0664M1H9337 5 58> Fixed Direct Access SCSI-2 device
    da4: <IBMRAID 0664M1H9337 5 58> Fixed Direct Access SCSI-2 device
    da6: <IBMRAID 0664M1H9337 5 58> Fixed Direct Access SCSI-2 device
    da5: <IBMRAID 0664M1H9337 5 58> Fixed Direct Access SCSI-2 device

    in a vinum stipped volume...

    judeah# more /etc/vinum.conf
    drive a device /dev/da0e
    drive b device /dev/da1e
    drive c device /dev/da2e
    drive d device /dev/da3e
    drive e device /dev/da4e
    drive f device /dev/da5e
    drive g device /dev/da6e

    volume dataraid
      plex org striped 256k
          sd length 1920m drive a
          sd length 1920m drive b
          sd length 1920m drive c
          sd length 1920m drive d
          sd length 1920m drive e
          sd length 1920m drive f
          sd length 1920m drive g

    Perusal of /var/log/messages show...

    May 3 11:17:31 judeah /kernel: (da1:ahc0:0:1:0): SCB 0x5a - timed out
    May 3 11:17:31 judeah /kernel: >>>>>>>>>>>>>>>>>> Dump Card State Begins <<<<<<<<<<<<<<<<<
    May 3 11:17:31 judeah /kernel: ahc0: Dumping Card State while idle, at SEQADDR 0x7
    May 3 11:17:31 judeah /kernel: Card was paused
    May 3 11:17:31 judeah /kernel: ACCUM = 0x97, SINDEX = 0x52, DINDEX = 0x8c, ARG_2 = 0x0
    May 3 11:17:31 judeah /kernel: HCNT = 0x0 SCBPTR = 0x1
    May 3 11:17:31 judeah /kernel: SCSISIGI[0x0] ERROR[0x40] SCSIBUSL[0x0] LASTPHASE[0x1]
    May 3 11:17:31 judeah /kernel: SCSISEQ[0x12] SBLKCTL[0x0] SCSIRATE[0x0] SEQCTL[0x10]
    May 3 11:17:31 judeah /kernel: SEQ_FLAGS[0xc0] SSTAT0[0x5] SSTAT1[0xa] SSTAT2[0x0]
    May 3 11:17:31 judeah /kernel: SSTAT3[0x0] SIMODE0[0x0] SIMODE1[0xa4] SXFRCTL0[0x80]
    May 3 11:17:31 judeah /kernel: DFCNTRL[0x0] DFSTATUS[0x29]
    May 3 11:17:31 judeah /kernel: STACK: 0x0 0x166 0x109 0x3
    May 3 11:17:31 judeah /kernel: SCB count = 130
    May 3 11:17:31 judeah /kernel: Kernel NEXTQSCB = 30
    May 3 11:17:31 judeah /kernel: Card NEXTQSCB = 30
    May 3 11:17:31 judeah /kernel: QINFIFO entries:
    May 3 11:17:31 judeah /kernel: Waiting Queue entries:
    May 3 11:17:31 judeah /kernel: Disconnected Queue entries: 2:90
    May 3 11:17:31 judeah /kernel: QOUTFIFO entries:
    May 3 11:17:31 judeah /kernel: Sequencer Free SCB List: 1 0
    May 3 11:17:31 judeah /kernel: Sequencer SCB Info:
    May 3 11:17:31 judeah /kernel: 0 SCB_CONTROL[0xe2] SCB_SCSIID[0x67] SCB_LUN[0x0] SCB_TAG[0xff]
    May 3 11:17:31 judeah /kernel: 1 SCB_CONTROL[0xe2] SCB_SCSIID[0x67] SCB_LUN[0x0] SCB_TAG[0xff]
    May 3 11:17:31 judeah /kernel: 2 SCB_CONTROL[0x66] SCB_SCSIID[0x17] SCB_LUN[0x0] SCB_TAG[0x5a]
    May 3 11:17:31 judeah /kernel: Pending list:
    May 3 11:17:31 judeah /kernel: 90 SCB_CONTROL[0x62] SCB_SCSIID[0x17] SCB_LUN[0x0]
    May 3 11:17:31 judeah /kernel: Kernel Free SCB list: 82 88 14 115 12 83 120 92 45 8 16 5 59 124 31 29 38 18 73 42 93 64 19 7 74 100 113 75 24 3 86 71 20 108 6 67 68 125 105 97 110 34 54 87 106 25 61 109 123 47 44 66 53 94 84 76 65 77 72 9 69 32 17 55 119 1 22 91 4 112 56 27 102 62 13 15 128 50 33 51 81 37 57 28 99 117 85 36 41 11 121 49 0 80 35 39 40 95 26 96 10 58 118 122 127 111 2 126 70 98 89 21 60 46 48 78 43 101 23 79 52 63 129 103 104 107 116 114
    May 3 11:17:31 judeah /kernel:
    May 3 11:17:31 judeah /kernel: <<<<<<<<<<<<<<<< Dump Card State Ends >>>>>>>>>>>>>>>>>>

    The box rebooted and failed to come up to it's normal state because the the
    vinum volume that was running off this SCSI disk system failed to load.

    May 3 11:22:01 judeah /kernel: sg[0] - Addr 0x1ddd000 : Length 4096
    May 3 11:22:01 judeah /kernel: sg[1] - Addr 0x7be000 : Length 4096
    May 3 11:22:01 judeah /kernel: (da1:ahc0:0:1:0): no longer in timeout, status = 34b
    May 3 11:22:01 judeah /kernel: ahc0: Issued Channel A Bus Reset. 1 SCBs aborted
    May 3 11:22:01 judeah /kernel: vinum: dataraid.p0.s1 is stale by force
    May 3 11:22:01 judeah /kernel: vinum: dataraid.p0 is corrupt
    May 3 11:22:01 judeah /kernel: fatal :dataraid.p0.s1 write error, block 1905465 for 8192 bytes
    May 3 11:22:01 judeah /kernel: dataraid.p0.s1: user buffer block 13336624 for 8192 bytes

    It looks like SCSI disk da1 was timing out but recovered. This is
    speculation on my part. Upon rebooting today, da1 seems to be OK?

    May 10 07:03:00 judeah /kernel: da1 at ahc0 bus 0 target 1 lun 0
    May 10 07:03:00 judeah /kernel: da1: <IBMRAID 0664M1H9337 5 58> Fixed Direct Access SCSI-2 device
    May 10 07:03:00 judeah /kernel: da1: 10.000MB/s transfers (10.000MHz, offset 15), Tagged Queueing Enabled
    May 10 07:03:00 judeah /kernel: da1: 1920MB (3933040 512 byte sectors: 255H 63S/T 244C)

    So, the question, do I have a hardware failure? If so, is it the
    Adaptec 2940/UW controller or the SCSI disk? When I get this resolved,
    I'll obviously have to figure out how to fix my corrupt vinum volume :(

    -- 
    Regards,
    Doug
    _______________________________________________
    freebsd-questions@freebsd.org mailing list
    http://lists.freebsd.org/mailman/listinfo/freebsd-questions
    To unsubscribe, send any mail to "freebsd-questions-unsubscribe@freebsd.org"
    

  • Next message: Bill Campbell: "Re: Help: Tip on Buying External modem"