changed WWID on cluster member boot disk

From: Bill Bennett (BENNETT_at_MPGARS.DESY.DE)
Date: 08/19/04

  • Next message: Ron Bramblett: "KZPSC-BA Raid Controller Panic"
    Date: Thu, 19 Aug 2004 22:02:37 +0200
    To: tru64-unix-managers@ornl.gov
    
    

    Hello Managers,

    I have a DS20E that was recently upgraded to 5.1B PK3 and then made a
    single-member cluster; the second member has not yet been added to
    the cluster. The disks containing the cluster root, usr and var
    file systems, member boot disks, quorum disk and a few user disks
    are actually all partitions (seen as LUNs of one SCSI ID by the
    DS20E) on a single RAID set on a third-party hardware RAID system
    (CMD CRD-5500); at the moment, a KZPBA-CB controller in the DS20E
    and the CRD-5500 are the only devices on what will eventually be
    the shared cluster SCSI bus.

    Today we had a short power outage in the computer room; unfortunately,
    one of the things I hadn't gotten to yet was to put the RAID system
    on a UPS, so it lost power while the DS20E stayed up. Although the
    RAID set per se came up undegraded when power was restored, the DS20E
    was hung when I found it. On resetting the DS20E, I could see at the
    SRM console that it could still find all the LUNs from the RAID system,
    but an attempt to boot the DS20E as a single-member cluster failed;
    early in the boot output I saw the line:

      drd_config_thread: 5 previously unknown devices

    and later the cluster reboot got stuck, repeatedly printing the line:

      waiting for cluster member boot disk to become registered

    I was able to boot from the stand-alone (pre-cluster) system disk; during
    the boot of the stand-alone system, a number of new special device files
    were created, and after the system was up, I could see that the problem
    is that the WWIDs for 5 of the 6 LUNs of the RAID system had changed
    somehow ... actually, in each case, four digits of a 32-digit hex string
    in the WWID were changed, although the WWIDs remain unique (or at least
    different from one another). The LUN of the RAID set whose WWID did not
    change was LUN 0, which contained the cluster root, usr and var file
    domains, but the WWIDs of the LUNs containing the member boot disks,
    quorum disk and user disks did change.

    I have no idea what caused that, and since it is clearly not a DEC/
    Compaq/HP device, this is probably not the place to find out ... but
    if anyone has any insight as to what might cause that or how it can
    be avoided in the future, I would be happy to hear them...

    But my more immediate problem is how to recover from this situation in
    which the first cluster member can't find the it's boot disk because the
    WWID of the disk has changed. I can in principle access all the disks
    now after booting from the stand-alone system, but I haven't yet ruled
    out the possibility that some of the AdvFS domains were corrupted when
    the RAID system lost power.

    I can imagine, perhaps naively, three ways that it might be possible
    to recover from this problem, so as I sit down to look at the hardware
    management documentation in more detail, I thought it would be a good
    time to ask for pointers ... perhaps someone can at least help me rule
    out the bad ideas sooner rather than later.

    Given that on booting the stand-alone system, the RAID LUNs with new
    WWIDs were assigned new HWIDs and device names (they were dsk3-dsk7
    but are now dsk14-dsk18), it seemed to me that it should be possible
    to do the following:

     1) use the 'hwmgr -delete component -id oldHWID' and 'dsfmgr -m
    newdev olddev' commands to restore the wayward LUNs to their previous
    device names; this would presumably update the hardware database files
    on the stand-alone system to account for the new WWIDs of these LUNs.
    Then after verifying the file domains on the RAID LUNs (and where needed
    restoring corrupted domains from backup) on the stand-alone system, one
    could in principle mount the cluster root filesets temporarily on the
    stand-alone system and copy the modified device database files to them.

    The problem with this idea is that I don't know enough about how the
    Tru64 hardware management works to be certain that the updated database
    files from the stand-alone system would be usable by the cluster, or
    for that matter, exactly which hardware database files would need to be
    copied...

     2) leave the new device names as they are and update the links in
    /etc/fdmns on the stand-alone system disk to point to the new devices;
    after doing that, I could verify the domains and restore from backup
    as needed, then temporarily mount the cluster root filesets on the
    stand-alone system to update the AdvFS links on them, too, so that the
    cluster could find all of its disks on the next boot.

    But this would only work if the cluster makes the same new device
    name assignments as the stand-alone system, and I'm not sure how good
    a bet that is...

     3) restore the device assignment of the RAID LUNs on the stand-alone
    system as in option 1, then verify and restore from backup only the
    user disks so that all then entries in /etc/fstab on the stand-alone
    system are working again; then run clu_create to recreate the single-
    member cluster again from scratch. Restore modified configuration
    files selectively from backup to bring the system back to the state
    it was in before the WWIDs changed.

    I think that is the most likely option to work, but that last step
    might not be as simple as it sounds...

    Any suggestions or pointers to relevant documentation would be
    greatly appreciated!

    Regards,
    Bill Bennett

    ----------------------------------------------------------------------------
    Dr. William Bennett within Germany International
    MPG AG Ribosomenstruktur Tel: (040) 8998-2833 +49 40 8998-2833
    c/o DESY FAX: (040) 897168-10 +49 40 897168-10
    Notkestr. 85
    D-22603 Hamburg Internet: bennett@mpgars.desy.de
    Germany


  • Next message: Ron Bramblett: "KZPSC-BA Raid Controller Panic"

    Relevant Pages

    • v880 internal array death
      ... I have a 4 node cluster of v880's that refuses to gracefully accept ... The six internal disks on the 880 are used for booting the system only. ... Root is encaplusated and mirrored on disks 0 and 1 and the system can boot ... root@DT5AE1:/:# luxadm display FCloop ...
      (SunManagers)
    • Re: Interesting cluster config "deadlock"
      ... I managed an environment where a VAX with locally attached DSSI disks ... needed stuff from the Alphas to boot and the Alphas needed stuff from ... We also needed to retain cluster quorum. ...
      (comp.os.vms)
    • a small problem
      ... disks in and expansion box,and a diskless vaxstation3100.I installed all software in one of the rz26,and made a cluster,everything was ok. ... with disk server.The vax boots ok.No hardware error visible anywhere ..All was done with the last hobby cdrom I have with vms7.2.What is the problem please?What did I forget to do? ... A last test:without changing,moving even touching nothing:boot from scsi is ok.thanks for your help. ...
      (comp.os.vms)
    • Re: clustering win2k3 enterprise problem.
      ... to have the boot disk, pagefile disks and the cluster disks all on the same ... Windows 2000 MCSE + MCDBA ... Windows NT/2000/2003 Cluster Technologies ...
      (microsoft.public.windows.server.clustering)
    • Daily Report #4165
      ... The resultant cleaned cluster CMDs will ... well-understood host galaxy environment. ... The Nature of the Halos and Thick Disks of Spiral Galaxies ... ACS, NICMOS, and WFPC2 in parallel. ...
      (sci.astro.hubble)