Re: Creating a wide area VMS Cluster
From: Keith Parris (keithparris_NOSPAM_at_yahoo.com)
Date: 18 Sep 2003 12:06:14 -0700
Lyndon Bartels <email@example.com> wrote in message news:<3F68D16F.532774C2@pressenter.com>...
> I'm thinking of building a wide area cluster.
> My goal is to provide a disaster tolerant cluster for both OS and data.
An excellent choice.
> Image an equalateral triangle with a site at each point. The sides
> represent the network traffic. Any link can go down and network traffic
> will go around the triangle in the other direction. Each side of the
> triangle is about 12 miles.
Allowing the links to the quorum site to be a backup for the main
inter-site link can be an economical solution. But be aware that if
you have a bridge at each site, and all the bridges are linked
together, then the Spanning Tree algorithm will turn off one of the 3
legs of your triangle, as otherwise there would be a loop in the LAN,
and things like multicast packets would circulate "forever".
As most of your traffic will be between sites A and B, you might want
to set up your bridges' Spanning Tree root priorities so that, say,
the bridge at Site A is normally selected as the root bridge (and the
link between sites B and C would normally be the disabled one), with
the bridge at Site B having the next-best priority value, so that it
would become the root bridge if Site A were down.
Another way to avoid having any of the (typically expensive)
inter-site links turned off would be to have two separate spanning
trees, by separating the bridges (this requires multiple LAN adapters
in the hosts at sites A and B, but that's a good thing to have
Site A Bridge ----------------- Bridge Site B
\ Site C /
Bridge -------- Bridge ------- Bridge
For more information on the Spanning Tree algorithm, I highly
recommend the excellent and entertaining book "Interconections" (2nd
ed.) by Radia Perlman, ISBN 0-201-63448-1.
> First lets talk host config. and skip data.
> I'm thinking of two possible configs:
> I could give all five hosts 1 vote each. That
> would yield 5 expected votes and a quorum of 3. Any one site could fail
> and cluster quorum would be maintained by the other two.
> I could give one host at each site one vote, yielding 3 expected votes,
> and a quorum of 2. The advantage there would be if I added a node to
> site A ("Bambam") quorum would not have to be recalculated, and I'd still
> have site equality.
As with the other posters, I'd go with one vote per system, as it
gives you more flexibility to take nodes up and down without
disrupting the balance of the effect of votes between sites A and B.
As another poster pointed out, if you needed an unequal number of
nodes in sites A and B, either give one node 2 votes or one node zero
votes to equalize the votes between sites A and B.
> I'm thinking the number votes at site A has must be equal to the votes
> at site B.
I would certainly recommend that, as it allows the node at Site C to
cast the tie-breaking vote, avoids making an arbitrary decision ahead
of time as to which of Site A or B will continue in the event of a
loss of communications between the two sites (and instead allows
whichever site can still talk to Site C to continue), and reduces your
exposure to the "Creeping Doom" scenario.
And remember that you'll normally NOT want to use the REMOVE_NODE
option to SHUTDOWN.COM when you take down nodes at sites A or B, as
that effectively unbalances the votes between sites.
> I have DWDM delivered fibre fabric between sites A and B.
This is great, as it allows multiple channels of multiple types (i.e.
Gigabit Ethernet and Fibre Channel), so there is lots of bandwidth
between sites for things like shadowing full-copies and full-merges.
And it is very cost-effective, sending all that data down a fiber
> All disks are attached via fibre channel. Except site C which will only
> have a system disk and be attached via SCSI.
The nice things about having a Fibre Channel inter-site link include:
o Shadowset members at each site can be accessed remotely and kept
up-to-date even if all systems at that site are down
o The small amount of host CPU overhead and small additional latency
of going through the VMS MSCP server is avoided
o You have the option of a single shadowed system disk between the
two sites, if you wish. I don't recommend it, as it then represents a
single point of failure for (two sites of) the cluster, but some folks
do it anyway for the convenience, just being sure they have a backup
readily available (often a recent backup copy kept online) in case
someone scrozzles the sole system disk -- they would simply reboot
nodes from the backup disk to get going again.
> I'm thinking that each site will have a copy of the system disk.
> Identical except for volume labels.
This is common in practice.
> I'm thinking this will be the mount prodecure;
> $ IF F$GETDVI("$1$DGA1001:","EXISTS") THEN -
> $ MOUNT/SYSTEM/NOASSIST DSA100:/SHADOW=("$1$DGA1001:") DATA1 DATA1
> $ IF F$GETDVI("$1$DGA1501:","EXISTS") THEN -
> $ MOUNT/SYSTEM/NOASSIST DSA100:/SHADOW=("$1$DGA1501:") DATA1 DATA1
The problem I see right away with this is that if a node at Site A
boots first, it will mount the data disk with only a single member,
$1$DGA1001:, and you'll have to do a Full Copy to add the other
member. You might want to have it check if both members are
available, and mount the shadowset with both members if so. (It
actually gets a bit more complicated than this in practice -- see my
1-day VMS DT cluster seminar notes at http://www2.openvms.org/kparris/
> I'm thinking that if I move the sysuaf file, etc. off the system disk
> onto a shadowed disk, I'll be able to have the multiple copies of the
> system disk. This buys me the possibility of taking one site off-line for
> upgrades if necessary.
Right. Moving the cluster-common files off system disks also makes
system management of a multiple-system-disk cluster much easier as
> I'm trying to keep data replicated between sites A and B. But at the
> same time, keep all the read I/Os local. I want as little data as
> possible to travel between the two sites.
Look into the new $SET DEVICE/SITE command. Because Shadowing can't
tell the difference between a Fibre Channel disk at the local site and
one at the remote site, by default it would send reads to both on a
round-robin basis, so you want to tell Shadowing which site each node
and each disk are at, so that it can send all the reads to the local
disks from a given site.
$SET DEVICE/READ_COST could also be used, but is more trouble to set
up than using /SITE.
> I'm thinking in this setup, I could loose disks, hosts, or combinations
> of them, and still be able to run. I'm thinking for what little data the
> third site would need, MSCP served disks would suffice. And I could put a
> hook in the startup to only mount the disks with the UAF data (and maybe user
> accounts) via this method.
> I'm thinking that by mounting the other sites' system disk, I could then
> copy config files (if they were to change) betweens sites. etc.
I can think of a couple of complications with this:
1) In practice, when you lose a site, you'll end up with the system
disk at the lost site going into mount verification timeout state on
the remaining nodes, and you'll need to do a $DISMOUNT/ABORT to clear
2) Also, the more nodes that have a disk mounted, the greater a chance
of a Full Merge happening, as that is triggered whenever any one of
the nodes with the shadowset mounted crashes.
Another alternative would be to mount the disks only when you need to,
or use DECnet FAL access to grab files through the associated node.
> If I were to set this up..
> What would potential test scenarios be?
I'd want to test the failure and recovery of any component, and any
site, and be sure your system managers/operators know how to diagnose
and handle each case.
Handle cases where, for example, Fibre Channel communications between
Sites A & B is lost but SCS communications continues. (I recommend
setting up MSCP-serving as a backup to Fibre Channel, and running
7.3-1 or better, so you get direct-to-MSCP-served failover.)
> What would cause cluster communication failure?
o Errant backhoes, if you don't have diverse network paths. :-)
o Hardware failures, if you don't have redundancy configured.
o Hardware failures, even if you DO have redundancy configured, if
you don't monitor, detect, and repair the first failure before a
second failure occurs. (I recommend setting up LAVC$FAILURE_ANALYSIS)
o Saturation of its Primary CPU in interrupt state can cause loss of
cluster communications with a node.
But I get the feeling I didn't really understand your question.
> Would there be false failures, and how could I avoid/detect them?
Could you explain what you mean by false failures?
> What have I missed?
> What have I mis-understood or misconfigured?
> What more is needed?
I recommend that you purchase the Disaster Tolerant Cluster Services
(DTCS) package right up front. It doesn't add a significant amount of
cost compared with the investment you're already planning. It will
help you with planning, design, configuration, implementation,
testing, and staff training, and ensure that nothing important gets
> Is anybody else doing anything like this?
Yes indeed. Having a 3rd site with a quorum node is a luxury, so not
as many sites have that, but it is very nice to have, so others will