Tru64 Unix cluster goes belly-up - trying to figure out why!

From: Chris Knorr (cknorr_at_trapsystems.com)
Date: 11/16/04

  • Next message: Bugs: "pg_nreaders going negative"
    Date: Tue, 16 Nov 2004 11:12:18 -0500
    To: tru64-unix-managers@ornl.gov
    
    

    If anyone has any thoughts or ideas on this I’d be extremely grateful. I
    justified the purchase of a high availability cluster to my boss and I’m not
    looking particularly good at the moment. :^(
     
    We have a fairly new (3 months old) 2-node ES40 cluster; not using a hub,
    just a straight connect between 2 memory channel cards. Both running V5.1B
    (Rev. 2650). Both machines have HBA cards configured for multi-pathing,
    connecting to our StorageWorks SAN (HGA80’s).
     
    When we came in this morning we had a crowd of users saying they could not
    connect to the cluster. We noticed immediately that we could not connect to
    either machine from the KVM console. The screen was not at a blue screen –
    just totally unresponsive. The LED display on the front of both machines
    showed the machine names that we’d set from the chevron prompt, telling us
    at least there was power to the boxes. From a remote machine we were able to
    ping one of the machines (“wasp”) but not the other (“hornet”). However we
    could not telnet to wasp. Effectively, we were completely dead in the water.
     
    We powered off both machines and tried booting “hornet”. It immediately
    complained about the HBA card it was trying to boot off. At this point we
    replaced this card with a spare and the machine booted fine. We then booted
    the second node (“wasp”) and it also booted fine. I suspect that we may have
    been successful if we had tried booting off the 2nd HBA card on “hornet”,
    but we never tried that.
     
    My basic questions are:
     
    • It seems like we had a hardware problem on hornet, but wasp was still
    “ping-able”. Why couldn’t we telnet to it?
    • Given HBA cards configured for multi-pathing, why would the failure of one
    HBA card cause the machine to go down, or not be responsive?
     
    Just to add to the mystery, we have no crash dump created, there are no
    relevant errors reported in the error log, and no relevant messages reported
    in the messages file. The errors we received prior to replacing the HBA card
    on the console were:
     
    Initializing pkb pka dqa dqb eia eib pgb pga
    Pga HARD restarted failed
    NVRAM format incorrect
     
    After this, the entire display was fulled with numbers,  and then
     
    ELS stalled for too long pga 0.0.0.2.1
    Pga port initialization failed
    Ega
     


  • Next message: Bugs: "pg_nreaders going negative"