HSZ 40 / DEC Alpha Cluster / Problem after power failure

From: Christian Wessely (christian.wessely_at_uni-graz.at)
Date: 12/17/04

  • Next message: Christian Wessely: "SUMMARY: HSZ 40 / DEC Alpha Cluster / Problem after power failure"
    Date: Fri, 17 Dec 2004 08:11:28 +0100
    To: tru64-unix-managers@ornl.gov
    
    

    Complete output of SHOW THIS FULL at the HSZ40 at bottom of message

    Hello Admin wizards,

    after almost two years of painless work we suffered from a power failure
    last morning.
    Even though we have a UPS connected the system did not power down
    regularly as it was supposed to do after the 15 minutes period. So what
    I have now is the following problem:

    The server itself boots up normally, but does not find the connected HSZ
    40 (dual redundant) controllers, better: the defined raidsets, formerly
    known as HSZ40#raid -> /dev/rrz17g and so on. I already figured that the
    symlinks in /etc/fdmns are missing; I recreated them by unpacking the
    fdmns dir from a tarfile, but no success. Am I supposed to do a MAKEDEV
    in /dev/ again?

    Having connected the laptop to the device, one of them says
    --------------------------------------------------
    This controller has an invalid cache module
    Controllers misconfigured. Type SHOW THIS_CONTROLLER
    Power Supply failure cleared.
    Invalid cache -- CLI command set reduced. Type SHOW THIS_CONTROLLER.
    Please-
    see user guide to determine corrective action
    --------------------------------------------------
    The other one seems to be ok, and it shows the raids and units as it is
    supposed to be.
    The user guide suggests to use the command
    CLEAR_ERRORS INVALID_CACHE THIS_CONTROLLER

    but the device responds with
    HSZ_UNTEN > CLEAR_ERRO
    Incomplete command

    it wont accept complete commands but breaks after an unpredictable
    length. Same if I try to issue SET FAILOVER COPY=OTHER - breaks after
    SET FAILO and complains about incomplete command ...

    I wonder why, and how I can get out of this mess ....

    would be grateful for any hint

    regards
    CW

    --------SHOW THIS FULL----------------------------------
    HSZ_UNTEN > show this full
    %CER--HSZ_UNTEN > --13-JAN-1946 04:33:29 (time not set)-- Invalid cache
    -- CLI-
    command set reduced. Type SHOW THIS_CONTROLLER. Please see user guide to-
    determine corrective action
    HSZ_UN
    Controller:
             HSZ40 ZG62003200 Firmware V31Z-4, Hardware A01
             Configured for dual-redundancy with ZG65008815
                 Controllers misconfigured -- other controller not in
    failover, a
                 SET FAILOVER COPY= is required to re-synchronize controllers
             SCSI address 7
             Time: NOT SET
    Host port:
             SCSI target(s) (1, 2, 3, 4), Preferred target(s) (1, 2, 3, 4)
             TRANSFER_RATE_REQUESTED = 10MHZ
    Cache:
             32 megabyte write cache, version 2
             Cache is INVALID. Cache containing unflushed data
              has been removed from this controller
             Unknown unflushed data in cache
             CACHE_FLUSH_TIMER = DEFAULT (10 seconds)
             CACHE_UPS
             Host Functionality Mode = A
    Licensing information:
             RAID (RAID Option) is ENABLED, license key is VALID
             WBCA (Writeback Cache Option) is ENABLED, license key is VALID
             MIRR (Disk Mirroring Option) is ENABLED, license key is VALID
    Extended information:
             Terminal speed 9600 baud, eight bit, no parity, 1 stop bit
             Operation control: 00000000 Security state code: 16536
             Configuration backup enabled on 12 devices
    This controller has an invalid cache module
    Controllers misconfigured. Type SHOW THIS_CONTROLLER
    Invalid cache -- CLI command set reduced. Type SHOW THIS_CONTROLLER.
    Please-
    see user guide to determine corrective action
    HSZ_UNTEN >


  • Next message: Christian Wessely: "SUMMARY: HSZ 40 / DEC Alpha Cluster / Problem after power failure"