Problem with CLUSTER_CONFIG? (was: Re: DIFVOLMNT (%X0072832C), then bugcheck whenbooting any node.)

From: Galen Tackett (gtackett_at_yahoo.com)
Date: 10/07/03


Date: Mon, 06 Oct 2003 21:13:34 -0400

I now understand how this problem occurred. Pending further information,
I'd say it appears to be a weakness in the CLUSTER_CONFIG process.

I reran CLUSTER_CONFIG up until the point were you'd normally boot the
the new system , which in this case is to be a voting boot server (see
my original post below for a few more config details). While
CLUSTER_CONFIG waited for the new node to boot, I examined the new
node's parameters in its system-specific ALPHAVMSSYS.PAR.

It turns out that the CLUSTER_CONFIG process had set EXPECTED_VOTES=1.

Now, recall that the new system at first had a bad Ethernet connection
(and it has no other alternative path for cluster comms). I examined my
boot logs from that original attempt to add the new node.

Sure enough, when the new noted booted (to run its initial AUTOGEN) it
saw no other cluster nodes, and it formed a VAXcluster all by itself. I
assume that at this point it corrupted the boot volume's SCB (Storage
Control Block, the first block in BITMAP.SYS), which led to the
DIFVOLMNT when we next tried to reboot an existing cluster node.

SO: Is it reasonable to have CLUSTER_CONFIG set up a new voting node to
initially use EXPECTED_VOTES of 1? Doing so means that there's a risk of
corrupting the system disk if the new node needs a LAN to boot but
doesn't have a good connection.

And since CLUSTER_CONFIG is executing on a cluster with at least one
vote already present, couldn't CLUSTER_CONFIG.COM set EXPECTED_VOTES to
at least 2? Or perhaps it could look at the other nodes in the cluster
to get their EXPECTED_VOTES?

Perhaps there's some rationale for having EXPECTED_VOTES of 1, but is it
worth the possibility of corrupting the system disk in a scenario like
this? If so, perhaps Hoff or one of the other experts can explain.

(Sure, you could say that we should have tested the LAN connection
beforehand, and you're probably right, but wouldn't a safety measure be
desirable here?)

Fortunately the damage was confined to several log files on the system
disk, and we didn't need to restore from backup.

In article <bdc65a53.0310030347.33bd7031@posting.google.com>,
 gspamtackett@yahoo.com (Galen) wrote:

> We've gotten into this situation with our cluster twice recently. (I'm
> not referring to a VOLALRMNT error, which is a different numerical
> status.)
>
> Configuration is:
>
> A single OpenVMS Alpha V7.3-1 system disk which is very current on
> patches.
> System disk lives on an HSG80, reached via a SAN core switch.
> Satellites do not have any shared-storage connections (i.e. no DSSI,
> no FibreChannel, no shared SCSI).
> 13 boot servers and 7 satellites, all Alphas.
> Running Storageworks RAID software (not sure how relevant).
>
> In both cases, we had recently run CLUSTER_CONFIG to add a new server
> node. However, in each case, the new node had no physical LAN
> connection (fiber not hooked up) and took a CLUEXIT bugcheck after a
> few minutes.
>
> Each time, shortly after the CLUEXIT, we got the node's LAN connection
> working and re-ran CLUSTER_CONFIG. Just after the new node reached the
> point where it reports there's no pagefile on the system disk
> (%SYSINIT-I-PAGEFILE), it reported:
>
> %SYSINIT-E-Error mounting system device, status = 0072832C
>
> We checked these things:
>
> * No other clusters with same cluster ID (we only have one other
> cluster)
> * All systems have VAXCLUSTER set to 2.
> * The volume label on the system disk has not been changed since the
> cluster was last booted.
>
> The only solution we've found is to reboot the cluster (not a pleasant
> option, of course).
>
> But we're just as concerned to find out what's causing this. I suspect
> that the CLUEXIT during CLUSTER_CONFIG somehow is involved but have
> only a little circumstantial evidence, as described here.
>
> HP software support and the maintainer of the MOUNT code have given us
> a little script to periodically check the volume's SCB and report if
> its checksum changes. Beyond that, they're out of ideas right now.
>
> (FYI, the bad connections occur because our fiber cable plant is very
> badly documented, has a lot of old labels, and some of the fibers have
> been damaged at one time or another. But this is another issue.)
>
> Thanks for any help or suggestions,
>
> Galen

-- 
Galen Tackett
*** To e-mail me just remove the spam from my address


Relevant Pages

  • RE: cluster upgrade stratedy
    ... We are thinking upgrade our 3-nodes cluster openvms 8.2-1 itanium to ... Right now every node boots off the the same system disk ... let node C boot from disk B (change logicals for some common files ...
    (comp.os.vms)
  • Re: Can I bring up just one node in a 3 node OVMS cluster
    ... > I was to take an image backup of the system disk of each of my nodes ... > The recommendation is to ensure the cluster is down when doing this. ... When you boot from the install CD you are not in a cluster, ... You will want to review your votes, quorum, and cluster common file ...
    (comp.os.vms)
  • Re: VMS analogue of FBSD and linux hier(7) man pages
    ... A cluster could have one system disk for each node in the ... The Alpha and Integrity systems all boot ...
    (comp.os.vms)
  • Re: Clustering on VMS 4.7
    ... They all tell me "This system disk is not set up ... disk a cluster system disk, can anyone tell me what the first step is? ... The 4.7 kit didn't really install VMS. ...
    (comp.os.vms)
  • Re: VMS analogue of FBSD and linux hier(7) man pages
    ... standalone system. ... SYS1-SYSC are optional on a VMS Cluster system disk; one root for each system in the cluster. ... I don't believe I've ever seen more than three systems booting from the same system disk although, in principle, you could have as many as thirteen. ... there were 33 or so system roots on the VAX system disk. ...
    (comp.os.vms)