Re: Creating a wide area VMS Cluster

From: Keith A. Lewis (lewis_at_mazda.mitre.org)
Date: 09/18/03


Date: Thu, 18 Sep 2003 15:12:54 +0000 (UTC)

Lyndon Bartels <lbartels@pressenter.com> writes in article <3F68D16F.532774C2@pressenter.com> dated Wed, 17 Sep 2003 21:26:07 -0500:
>I'm thinking of building a wide area cluster.
>
>My goal is to provide a disaster tolerant cluster for both OS and data.
>
>My current plan has three sites, A, B, & C.
>
>At site A I have a some hosts ("Fred" and "Barney") and data.
>At site B I have more hosts ("Wilma" and "Betty") and data.
>At site C I have one host ("Dino"), NO data.
>
>Image an equalateral triangle with a site at each point. The sides
>represent
>the network traffic. Any link can go down and network traffic will go
>around
>the triangle in the other direction. Each side of the triangle is about
>12
>miles.

Good design. I'm assuming you know what communications channels will allow
this, I'm no expert.

>I could give all five hosts 1 vote each. That
>would yield 5 expected votes and a quorum of 3. Any one site could fail
>and
>cluster quorum would be maintained by the other two.

That's the one.

>or:
>
>I could give one host at each site one vote, yielding 3 expected votes,
>and a
>quorum of 2. The advantage there would be if I added a node to site A
>("Bambam") quorum would not have to be recalculated, and I'd still have
>site
>equality.

No good. If you lost one site (say C), you could not even reboot a voting
node without losing quorum. If that node boots from a directly-connected
disk it's a relatively short downtime but you can avoid it altogether with
the first design.

If you add a third voting node to site A, you can balance it by giving one
of the nodes at site B 2 votes.

>I'm thinking the number votes at site A has must be equal to the votes
>at
>site B.
>
>The third site is used only as a quorum site. Site A and B will be where
>the
>work gets done.

Good plan.

>Now.. The data:
>
>I have DWDM delivered fibre fabric between sites A and B.
>
>All disks are attached via fibre channel. Except site C which will only
>have a
>system disk and be attached via SCSI.
>
>I'm thinking that each site will have a copy of the system disk.
>Identical
>except for volume labels.
>
>Assume the following disks:
>
>Site A:
>
>$1$DGA1500: SYSTEMA (system disk for site A)
>$1$DGA1501: DATA1 (some data disk)
>
>Site B:
>
>$1$DGA1000: SYSTEMB (system disk for site B)
>$1$DGA1001: DATA1 (some data disk)
>
>
>Site C:
>Dino$DKA0: SYSTEMC (system disk for site C)
>
>
>I'm thinking this will be the mount prodecure;
>
>$ IF F$GETDVI("$1$DGA1000:","EXISTS") THEN -
>$ MOUNT/SYSTEM/NOASSIST $1$DGA1000: SYSTEMB SYSTEMB
>$ IF F$GETDVI("$1$DGA1001:","EXISTS") THEN -
>$ MOUNT/SYSTEM/NOASSIST DSA100:/SHADOW=("$1$DGA1001:") DATA1 DATA1
>$!
>$ IF F$GETDVI("$1$DGA1500:","EXISTS") THEN -
>$ MOUNT/SYSTEM/NOASSIST $1$DGA1500: SYSTEMA SYSTEMA
>$ IF F$GETDVI("$1$DGA1501:","EXISTS") THEN -
>$ MOUNT/SYSTEM/NOASSIST DSA100:/SHADOW=("$1$DGA1501:") DATA1 DATA1
>$!

The ideal command for a cold mount of DSA100 would use
/SHADOW=($1$DGA1001:,$1$DGA1501:), which would avoid any rebuilding in the
case where it was dismounted cleanly last time around. If DSA100 is already
mounted on another node, you don't need /SHADOW= at all.

If you change /SYSTEM to /CLUSTER it makes things easier the first time.
Once the cluster is up all the disks will continue to "exist" even if they
are unavailable.

>I'm thinking that if I move the sysuaf file, etc. off the system disk
>onto a
>shadowed disk, I'll be able to have the multiple copies of the system
>disk. This buys me the possibility of taking one site off-line for
>upgrades if necessary.

So the disk containing SYSUAF et al. would be shadowed across multiple
sites.

>I'm trying to keep data replicated between sites A and B. But at the
>same time,
>keep all the read I/Os local. I want as little data as possible to
>travel
>between the two sites.

So you *don't* want to shadow the data disk across multiple sites. It sort
of contradicts the above goal unless you're going to have one shadow set
dedicated to this system data.

I'm not sure if this is officially supported, but you can define/sys/exec a
logical SYSUAF to point to multiple files. If the volume containing the
first one is not mounted, VMS will use the next in the list. However, if
the site conting the first one goes down unexpectedly, the disk will go into
MountVerify rather than dismount, and processes attempting to access SYSUAF
will hang (including logins).

You might want to consider using a different SYSUAF (et al) at each site,
maybe with a batch job copying some of them around after certain events
(password change, addition of a user, etc.).

--Keith Lewis klewis$mitre.org
The above may not (yet) represent the opinions of my employer.



Relevant Pages

  • Re: Creating a wide area VMS Cluster
    ... > My goal is to provide a disaster tolerant cluster for both OS and data. ... disrupting the balance of the effect of votes between sites A and B. ... You have the option of a single shadowed system disk between the ...
    (comp.os.vms)
  • Re: VMS analogue of FBSD and linux hier(7) man pages
    ... A cluster could have one system disk for each node in the ... The Alpha and Integrity systems all boot ...
    (comp.os.vms)
  • Re: Creating a wide area VMS Cluster
    ... > My goal is to provide a disaster tolerant cluster for both OS and data. ... > I could give one host at each site one vote, yielding 3 expected votes, ... > I'm thinking that each site will have a copy of the system disk. ...
    (comp.os.vms)
  • Re: Cross-architecture booting
    ... The operating system must be installed and upgraded on a disk ... - Installation of the operating system on a disk that is directly ... Moving the resulting system disk so that it is accessible by the ... Since my VAX is diskless.... ...
    (comp.os.vms)
  • RE: Moving System Disk
    ... >> What advantages do you see in putting your system disk on the SAN? ... > had, to boot ...
    (comp.os.vms)