tru64 5.1A / HSG80 raid disk problems

From: Dirk Kleinhesselink (dkleinh_at_phy.ucsf.edu)
Date: 11/26/03

  • Next message: sematin_at_mtn.co.ug: "Tuning for oracle 8.1.7"
    Date: Wed, 26 Nov 2003 09:51:11 -0800 (PST)
    To: tru64-unix-managers@ornl.gov
    
    

    I have a 2 member (DS10s) Tru64 5.1A cluster connected with KGPSA fiber
    HBA cards to an HSG80 raid controller with several raid sets. Last
    Friday, my system locked up hard for a long time - NFS clients to the
    cluster all got a lot of NFS server not responding/NFS server OK messages.
    We rebooted the cluster and things seemed better, but on Monday there were
    more server problems and we noticed that filesystems on one of the
    raidsets seemed to really hang when we tried to access them directly on
    the cluster (i.e. not over NFS). Yesterday the system again locked up
    hard and we could not even reboot the cluster without resetting the
    HSG80. When the system came up, I opened a console on the HSG80 and saw
    spurts of error messages on the HSG80 console referring to 2 disks of the
    raidset that was hanging. One of the disks seemed to have more errors -
    it's hard to tell because you need to capture the rapidly spooling
    output. I called HP and we paid (no maintenance contract) to get service
    techs out with 2 replacement disks and were able to fail out (reduce) and
    reconstruct the raid system (one disk at a time). The tech started the
    process with the first disk and I finished with the 2nd disk after the
    first reconstruction was finished. While I was replacing the 2nd disk, I
    saw another, similar error message reported from one of the disks on the
    HSG80 console and this morning I opened a console on the HSG80 and at one
    point got another spurt of messages from a few more disks. The
    Error messages look like:
    %EVL--HSG80> --14-JAN-1946 05:03:10 (time not set)-- Instance Code: 0258000A
     Template: 81.(51)
     Power On Time: 2. Years, 140. Days, 10. Hours, 30. Minutes, 44. Seconds
     Controller Model: HSG80
     Serial Number: ZG11304588 Hardware Version: E12(2A)
     Software Version: V85F-0(55)
     Informational Report
     Unit Number: 21.(0015)
     Unit Software Version: 1.(01) Unit Hardware Version: 55.(37)
     Retry Level: 1. Retries: 1.
     Port: 4. Target: 1. LUN: 0.
     SCSI Device Type: 0.(00)
     Device ID: "BD036635C5" Device Serial Number: " 0108"
     Device Software Revision Level: "B017"
     SCSI Command Opcode: 40.(28)
     Sense Data Qualifiers: 64.(40)
     SCSI Sense Data:
      Error Code: 112.(70) {current command execution}
      Information field is valid
      Segment: 0.(00)
      Sense Key: 11.(0B) ABORTED COMMAND
      ILI: 0 EOM: 0 FM: 0
      Information: 3086EA08
      Additional Sense Length: 10.(0A)
      Command-Specific Information: 00000000
      ASC: 0.(00) ASCQ: 6.(06)
      FRU: 0.(00) Sense-Key Specific: 000000
     Instance Code: 0258000A

    Does anyone know if this means my disks are all failing, or my controller
    is failing or what ? I have had disks get marked as failed before and
    replaced them, but never had this. I also haven't generally sat on the
    hsg80 console for long periods to see if it is normal for there to be some
    error messages.

    Thanks for any insight.

    Dirk


  • Next message: sematin_at_mtn.co.ug: "Tuning for oracle 8.1.7"

    Relevant Pages

    • io saturation problems with hsg80
      ... the HSG80 can deliver which is 20mb persecond on the backplane of the ... controler to the disks, hence causing the machine to lock up. ... Having looked at the IO subsystem all the settings there are ... Alphaserver ES40 833mhz with KGPSA card ...
      (Tru64-UNIX-Managers)
    • HSG80 Unit assignments
      ... I have all the devices and raid arrays assigned in the HSG80 ... controller, I have one enclosure, 14 disks, raided as 4 big disks. ... HSG> del d103 ...
      (Tru64-UNIX-Managers)
    • HSG80 disk space
      ... I've searched the list history without success.. ... I have a HSG80 with 30 36.4GB disks, in multible bus failover, two ... HSG80 with 3 disks I can see only 33GB! ...
      (Tru64-UNIX-Managers)
    • HSG80 load balancing
      ... to load balance across the hba's to the disks on it. ... This would allow me to fail some of my paths around after a HSG80 ... Does anyone know how to do this or have a white paper on this? ...
      (Tru64-UNIX-Managers)