SUMMARY / UPDATE : HSZ 40 / DEC Alpha Cluster / Problem after power failure

From: Christian Wessely (christian.wessely_at_uni-graz.at)
Date: 12/20/04

  • Next message: Dege, Robert C.: "Compiling with POSIX"
    Date: Mon, 20 Dec 2004 08:19:43 +0100
    To: tru64-unix-managers@ornl.gov
    
    

    Since several users requested it, update of summary containig the
    complete procedure follows:

    a) Problem:
    Power failure longer than the connected UPS could stand - 30 minutes.
    After 20 minutes, the UPS software initiated a shutdown; unfortunately,
    the shutdown was not completed and at that very moment the routine
    mirroring of the main and backup raidsets was running ... We ended up
    with a server that came up without a problem but was unable to find the
    external raidshelf (SW300 containing 2x HSZ40 dual redundant and 3
    raidsets with 6 disks and 6 hot spares, each raidset one unit: main
    D100, mirror data D200, mirror web D300). HSZ lights were showing
    operative condition: channel leds off, reset light blinking.

    b) Diagnosis:
    tried to mount the main unit manually - fail. Checked /etc/fdmns -
    domains missing. Checked /dev/rrz17 - rrz19 files - ok.
    tried to connect to hsz using hszterm -f /dev/rrz17g - fail.

    connected notebook to serial port of HSZ40.
    SHOW THIS revealed
    This controller has an invalid cache module
    Controllers misconfigured. Type SHOW THIS_CONTROLLER
    Power Supply failure cleared.
    Invalid cache -- CLI command set reduced. Type SHOW THIS_CONTROLLER.
    Please - see user guide to determine corrective action

    SHOW OTHER showed ok.

    user guide (order nr. EK-HSFAM-SV.D01, Rev. Firmware 2.5) suggests:

    CLEAR_ERRORS INVALID_CACHE controller

    Tried this, but in vain. Desperation. UARRRRGH!
    Switching to offsite mirror, posting call for assistance to
    tru64-unix-managers@ornl.gov, hopping around madly, lighting a candle,
    praying.
    Answer by Phil Baldwin showed that the syntax suggested by the user
    guide was simply wrong. The correct syntax was:

    CLEAR_ERRORS controller INVALID_CACHE [destroy_unflushed_data] or
    [nodestroy_unflushed_data]

    Applying this - ok.
    Connecting notebook to defective controller (!!!), did SET THIS
    NOFAILOVER and afterwards issued SET FAILOVER COPY=OTHER (Dangerous -
    dont confuse the controllers here - COPY=[SOURCE] !!!

    ok, controllers back online.
    Show raid full: ok.
    Show units full:
        LUN Uses
    --------------------------------------------------------------
       D100 R1
             Switches:
               RUN NOWRITE_PROTECT READ_CACHE
               WRITEBACK_CACHE
               MAXIMUM_CACHED_TRANSFER_SIZE = 32
             State:
               INOPERATIVE
               Unit has lost data
               PREFERRED_PATH = THIS_CONTROLLER
               WRITE_PROTECT - DATA SAFETY
             Size: 41879900 blocks
       D200 R2
             Switches:
               RUN NOWRITE_PROTECT READ_CACHE
               WRITEBACK_CACHE
               MAXIMUM_CACHED_TRANSFER_SIZE = 32
             State:
               INOPERATIVE
               Unit has lost data
               PREFERRED_PATH = THIS_CONTROLLER
               WRITE_PROTECT - DATA SAFETY
             Size: 20539825 blocks
       D300 R3
             Switches:
               RUN NOWRITE_PROTECT READ_CACHE
               WRITEBACK_CACHE
               MAXIMUM_CACHED_TRANSFER_SIZE = 32
             State:
               INOPERATIVE
               Unit has lost data
               PREFERRED_PATH = THIS_CONTROLLER
               WRITE_PROTECT - DATA SAFETY
             Size: 20539825 blocks
    Cache battery charge is low

    OK, have to bring the units to operative state again.
    Solution:
    CLEAR_ERRORS LOST_DATA unit-number

    brought them back to operative state.
    All data and all sets ok. No further problems.

    Have to figure out the problem with the powerfail shutdown script anyway
    - I guess the system should come back up in stable condition after the
    shutdown initiated by xpowerchute.

    Thanks to all who replied and helped!
    CW


  • Next message: Dege, Robert C.: "Compiling with POSIX"

    Relevant Pages

    • Re: Do SATA connections have "priorities"
      ... similar scheme exist when connecting up STAT discs to mobo? ... controller - drives work fastest on the Southbridge (ICHx on intel-based ...
      (uk.comp.homebuilt)
    • Re: HSZ70 Trouble
      ... The hsz controller isn't showing any OP error codes. ... this system was shutdown improperly and the cache ... this point since your batteries are way dead. ...
      (comp.os.vms)
    • Re: install problems for freeBSD 6 on tyan i7520
      ... > and a CD connected through a separate controller channel. ... > I have tried many scenarios for connecting a single hard drive, ... The BIOS sees it. ... > drives seem to be there named just as the boot loader and BIOS see them. ...
      (freebsd-current)
    • RE: 5.4 install disc1 will not find hard drive
      ... I think the problem is in the driver. ... the PR system isn't an appropriate place for complaining that a mirror ... > that there is a CDROM on the secondary onboard IDE controller. ... > populated with non-functional disc1 iso files. ...
      (freebsd-questions)
    • RE: 5.4 install disc1 will not find hard drive
      ... I think the problem is in the driver. ... the PR system isn't an appropriate place for complaining that a mirror ... > that there is a CDROM on the secondary onboard IDE controller. ... > populated with non-functional disc1 iso files. ...
      (freebsd-stable)