TruCluster v5.1A server boot PANIC problem

From: Gergen, Peter (petergergen@kpmg.com.au)
Date: 04/17/03

  • Next message: Terry: "screend.conf - help an idiot - continued...."
    Date: Thu, 17 Apr 2003 22:37:32 +1000
    From: "Gergen, Peter" <petergergen@kpmg.com.au>
    To: tru64-unix-managers@ornl.gov
    
    

    Hi Fellow Managers

    I have created a cluster environment and reason for setting this
    configuration up is to do testing of patch kit upgrades before doing it on
    production servers.
    While setting up my test cluster, I ran into a problem that I had not seen
    before.
    The cluster consists of 2 x 500au Personal workstations with KZPSA
    controllers connected to a SWXRC-04 (HSZ40 equiv).
    I Got a single node cluster running on v5.1A w PK2. Then I Added the second
    cluster member boot disk through clu_add_member and
    booted that second member while the first was running. This worked and I was
    able to watch the second node being configured
    and updated. the network configuration failed and this was accomplished
    manually with problems and then this second node was up.
    I could change this node to run level S and back to run level 3 and any
    other run level and back as long as the server was not rebooted
    or shut down and attempted to be restarted. If the server was shut down or
    rebooted, then the following message appeared:
    CNX MGR: Invalid configuration for cluster seq disk and the server would
    panic and return to the SRM prompt.
    There seems to be a patch for this in v5.1B but no reference to this in
    v5.1A. See documentation below.
    Any assistance would be appreciated in solving this problem.

    ****************************************************************************
    **************

    I did some digging and came up with this, but it is for v5.1B:
    FROM:
    http://ftp1.support.compaq.com/public/unix/v5.1b/TruCluster_V5.1B/doc/txt/TC
    RPAT00005000540.txt
    PROBLEM: (93677, 92409, 94911, 92799) (PATCH ID: TCR540-012)
    ********
    PROBLEM: (93677) (PATCH ID: )
    This patch improves the responsiveness of EINPROGRESS handling during the
    issuing of I/O barriers. The fix removes a possible infinite loop scenario
    which could occur due to the deletion of a storage device.
    The issue with EINPROGRESS responsiveness is the continued looping while
    waiting for a disk structure to become available. No attempts were being
    made to force the availability of the disk structure.
    In addition, no retry limit was being enforced and no checks were being made
    for deleted devices. This combination presents the possibility of infinite
    retry attempts.

    PROBLEM: (92409) (PATCH ID: )
    This patch fixes a CNX manager panic encountered while multiple cluster
    nodes are booted simultaneously.
    The panic string seen is: CNX MGR: Invalid configuration for cluster seq
    disk

    >From : Patch Summary and Release Notes for Patch Kit 1 for v5.1B
    This manual contains information specific to Patch Kit 1 of the Tru64 UNIX
    operating system and TruCluster Server software products for Version 5.1B.

    Number: Patch 50.00
    Abstract: Fixes a regression associated with non-SCSI storage
    State: Supersedes Patch 27.00, 28.00, 29.00, 31.00
    This patch:
    * Fixes a regression associated with non SCSI storage.
    * Improves the responsiveness of EINPROGRESS handling during the
    issuing of I/O barriers by removing a possible infinite loop scenario that
    could occur due to the deletion of a storage device.
    * Fixes a problem that causes a panic with the message "CNX MGR:
    Invalid configuration for cluster seq disk" during simultaneous booting of
    cluster nodes.
    * Fixes a possible race condition between a SCSI reservation conflict
    and an I/O drain, which could result in a hang.
    * Alleviates a condition in which a cluster member takes an extremely
    long time to boot when using LSM.
    * Fixes a problem in the cluster kernel where a cluster member panics
    while doing remote I/O over the interconnect.
    * Corrects an issue to allow the Device Request Dispatcher, DRD, to
    retry to get disk attributes when EINPROGRESS is returned from the disk
    driver.
    * Fixes a problem in which access to the quorum disk can be lost if
    the quorum disk is on a parallel SCSI bus and multiple bus resets are
    encountered.
    ****************************************************************************
    **************

    Regards

    Peter Gergen
    Nexus Tru64/HP-UX/Win2K/Oracle System Administrator
    Tel: (03) 9288 6236 / 0418 475 575
    petergergen@kpmg.com.au
    kpmg Melbourne Australia

    **********************************************************************
    This email is intended only for the use of the individual or entity
    named above and may contain information that is confidential and
    privileged. If you are not the intended recipient, you are hereby notified that any dissemination, distribution or copying of this Email is strictly prohibited. When addressed to our clients, any opinions or advice contained in this Email are subject to the terms and conditions expressed in the governing KPMG client engagement letter. If you have received this Email in error, please notify us immediately by return email or telephone +61 2 93357000 and destroy the original message. Thank You.
    **********************************************************************M


  • Next message: Terry: "screend.conf - help an idiot - continued...."

    Relevant Pages

    • Re: Basic clustering Q
      ... MVP - Windows Server - Clustering ... > Ideal config for my file and print server cluster based on 2 node HP ... > Hot spare hard disk to provide extra FT across all disk configs. ... it's about support and having a working configuration ...
      (microsoft.public.windows.server.clustering)
    • Re: Building Exchange 2003 Server - Hardware & Config Questions
      ... 500 users per cluster? ... configuration comes from knowing the usage and storage patterns. ... it's the disk that becomes the bottleneck. ... There are Exchange server sizing utilities available from HP and likely Dell ...
      (microsoft.public.exchange.setup)
    • Re: Adding GPT Disk to Cluster
      ... both nodes in the cluster. ... No amount of configuration changes, ... "9DE71153" (note in mounted devices the signature is kept in little endian, ... versus what is on the actual disk. ...
      (microsoft.public.windows.server.clustering)
    • Re: MNS HELP URGENT
      ... It is possible to create an MNS cluster with this configuration, ... won't be able to add "physical disk" resources into your cluster. ... simulate a "shared disk" using IP to mirror local disks between hosts. ...
      (microsoft.public.windows.server.clustering)
    • SUMMARY: changed WWID on cluster member boot disk
      ... disk and quorum disk of a single-member cluster, ... I could no longer boot from the cluster disks, ... the pre-cluster stand-alone system disk; ... the root1_domain on LUN containing the member boot disk was found ...
      (Tru64-UNIX-Managers)