Problem with CLUSTER_CONFIG? (was: Re: DIFVOLMNT (%X0072832C), then bugcheck whenbooting any node.)
From: Galen Tackett (gtackett_at_yahoo.com)
Date: 10/07/03
- Next message: Spud Demon: "Re: Question about SMTP.CONFIG"
- Previous message: david20_at_alpha2.mdx.ac.uk: "Re: Fee Based Email (From Re: Process's PreciseMail AntiSpam...)"
- In reply to: Galen: "DIFVOLMNT (%X0072832C), then bugcheck whenbooting any node."
- Next in thread: Galen: "Re: Problem with CLUSTER_CONFIG? (was: Re: DIFVOLMNT (%X0072832C), then bugcheck whenbooting any node.)"
- Reply: Galen: "Re: Problem with CLUSTER_CONFIG? (was: Re: DIFVOLMNT (%X0072832C), then bugcheck whenbooting any node.)"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Date: Mon, 06 Oct 2003 21:13:34 -0400
I now understand how this problem occurred. Pending further information,
I'd say it appears to be a weakness in the CLUSTER_CONFIG process.
I reran CLUSTER_CONFIG up until the point were you'd normally boot the
the new system , which in this case is to be a voting boot server (see
my original post below for a few more config details). While
CLUSTER_CONFIG waited for the new node to boot, I examined the new
node's parameters in its system-specific ALPHAVMSSYS.PAR.
It turns out that the CLUSTER_CONFIG process had set EXPECTED_VOTES=1.
Now, recall that the new system at first had a bad Ethernet connection
(and it has no other alternative path for cluster comms). I examined my
boot logs from that original attempt to add the new node.
Sure enough, when the new noted booted (to run its initial AUTOGEN) it
saw no other cluster nodes, and it formed a VAXcluster all by itself. I
assume that at this point it corrupted the boot volume's SCB (Storage
Control Block, the first block in BITMAP.SYS), which led to the
DIFVOLMNT when we next tried to reboot an existing cluster node.
SO: Is it reasonable to have CLUSTER_CONFIG set up a new voting node to
initially use EXPECTED_VOTES of 1? Doing so means that there's a risk of
corrupting the system disk if the new node needs a LAN to boot but
doesn't have a good connection.
And since CLUSTER_CONFIG is executing on a cluster with at least one
vote already present, couldn't CLUSTER_CONFIG.COM set EXPECTED_VOTES to
at least 2? Or perhaps it could look at the other nodes in the cluster
to get their EXPECTED_VOTES?
Perhaps there's some rationale for having EXPECTED_VOTES of 1, but is it
worth the possibility of corrupting the system disk in a scenario like
this? If so, perhaps Hoff or one of the other experts can explain.
(Sure, you could say that we should have tested the LAN connection
beforehand, and you're probably right, but wouldn't a safety measure be
desirable here?)
Fortunately the damage was confined to several log files on the system
disk, and we didn't need to restore from backup.
In article <bdc65a53.0310030347.33bd7031@posting.google.com>,
gspamtackett@yahoo.com (Galen) wrote:
> We've gotten into this situation with our cluster twice recently. (I'm
> not referring to a VOLALRMNT error, which is a different numerical
> status.)
>
> Configuration is:
>
> A single OpenVMS Alpha V7.3-1 system disk which is very current on
> patches.
> System disk lives on an HSG80, reached via a SAN core switch.
> Satellites do not have any shared-storage connections (i.e. no DSSI,
> no FibreChannel, no shared SCSI).
> 13 boot servers and 7 satellites, all Alphas.
> Running Storageworks RAID software (not sure how relevant).
>
> In both cases, we had recently run CLUSTER_CONFIG to add a new server
> node. However, in each case, the new node had no physical LAN
> connection (fiber not hooked up) and took a CLUEXIT bugcheck after a
> few minutes.
>
> Each time, shortly after the CLUEXIT, we got the node's LAN connection
> working and re-ran CLUSTER_CONFIG. Just after the new node reached the
> point where it reports there's no pagefile on the system disk
> (%SYSINIT-I-PAGEFILE), it reported:
>
> %SYSINIT-E-Error mounting system device, status = 0072832C
>
> We checked these things:
>
> * No other clusters with same cluster ID (we only have one other
> cluster)
> * All systems have VAXCLUSTER set to 2.
> * The volume label on the system disk has not been changed since the
> cluster was last booted.
>
> The only solution we've found is to reboot the cluster (not a pleasant
> option, of course).
>
> But we're just as concerned to find out what's causing this. I suspect
> that the CLUEXIT during CLUSTER_CONFIG somehow is involved but have
> only a little circumstantial evidence, as described here.
>
> HP software support and the maintainer of the MOUNT code have given us
> a little script to periodically check the volume's SCB and report if
> its checksum changes. Beyond that, they're out of ideas right now.
>
> (FYI, the bad connections occur because our fiber cable plant is very
> badly documented, has a lot of old labels, and some of the fibers have
> been damaged at one time or another. But this is another issue.)
>
> Thanks for any help or suggestions,
>
> Galen
-- Galen Tackett *** To e-mail me just remove the spam from my address
- Next message: Spud Demon: "Re: Question about SMTP.CONFIG"
- Previous message: david20_at_alpha2.mdx.ac.uk: "Re: Fee Based Email (From Re: Process's PreciseMail AntiSpam...)"
- In reply to: Galen: "DIFVOLMNT (%X0072832C), then bugcheck whenbooting any node."
- Next in thread: Galen: "Re: Problem with CLUSTER_CONFIG? (was: Re: DIFVOLMNT (%X0072832C), then bugcheck whenbooting any node.)"
- Reply: Galen: "Re: Problem with CLUSTER_CONFIG? (was: Re: DIFVOLMNT (%X0072832C), then bugcheck whenbooting any node.)"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Relevant Pages
|