Booting a poor-man's LAVC [was:Interesting cluster config "deadlock"]
- From: cornelius@xxxxxxxxxxxxxxxxx (George Cornelius)
- Date: 4 May 2007 14:10:52 -0500
In article <1171983169.765046.30190@xxxxxxxxxxxxxxxxxxxxxxxxxxxx>, etmsreec@xxxxxxxxxxx writes:
I know some full production environments that have been like this for many months (years?)
I managed an environment where a VAX with locally attached DSSI disks
was clustered with a pair of turbolasers with shared SCSI. The VAX
needed stuff from the Alphas to boot and the Alphas needed stuff from
the VAX.
We also needed to retain cluster quorum.
Ultimate answer was to bring up the one Alpha with very little
starting. Then bring up the VAX and the other Alpha, then reboot the first Alpha.
Messy, but it worked.
JF Mezei wrote:
The local transformer blew its fuse on a very cold winter day. I was
litterally powerless to keep my systems running.
Upon rebooting, I found myself in an interesting situation. Being in the
(slow) process of moving stuff and restructuring my cluster, I found out my
cluster had been left in a precarious state !
SYSUAF (et all) was still on a node1 disk. User disk is on node2, but node2
boots off node3.
Old topic, I know, but I think it is important to mention that network based
clusters tend to have these issues, especially if you don't have multipathed
disks (CI,DSSI, Fibrechannel, etc).
CI and DSSI based clusters really are the gold standard and we have to work
around the missing features when we go to LAVC's.
With regard to non-shared disks, I have tried various things in vanilla,
low-to-moderate-budget LAVC's over the years.
My first idea, and I used this for some time, was a three member shadow
set for the critical data, with one member on each of the three uVAX 3100's
that made up the cluster.
I wrote an elaborate startup sequence that waited for a second member to
join when rebooting from scratch, then tried to mount the shadow set with
/NOCOPY and a minimum of two members. Unfortunately, I did not realize
that $ MOUNT/NOASSIST would mount _whatever_ members it saw instead of
the ones you wanted it to mount, and although I thought I was safe from
the old "fall backward in time" issue of mounting a stale shadow member
by my two-out-of-three code, I was in fact completely exposed [fixed now -
see $ MOUNT/POLICY].
When I learned this, I changed /NOASSIST to /ASSIST and played games:
issuing a mount in a spawned subprocess and killing from the parent
process if it did not eventually complete, on the assumption that the
hang was due to it attempting to ask the (unavailable) operator about
some inaccessible disk in the mount op. I still do this for my shared
SYSUAF disk in my Fibrechannel clusters, just because it's been working
for years and I don't want to destabilize anything. Note that the
startup also allows an option for dropping into DCL for manual mount
commands if all else fails.
I still maintain a couple of poor-man's LAVC's, and have finally decided
that if there are no funds to do anything truly highly available/highly
reliable, I should just put all the data on the boot node and let the
application nodes wait for it if it is down. But for one that I
inherited that does multiple node mirroring with non-multipathed disks,
I worked around some of the perennial cluster startup issues by just
throwing in a long delay in startup to make sure all the nodes have plenty
of time to join before going on to try to mount the shadow sets. At some
point I may make it a bit smarter, but as far as I am concerned, if there's
no budget for high availability then you don't have to pretend you are
offering it.
Finally, of course, it is allowed to have per-node UAF/RIGHTSLIST/etc,
if you have a mechanism for keeping them in synch. Basically, whatever
you change anywhere you have to change everywhere. And even if you
don't replicate that way it is probably a good idea to keep a reasonably
current copy of the SYSUAF stashed away somewhere on each boot disk so
you can still log in as SYSTEM if you have to boot without the primary
copy.
--
George Cornelius cornelius(at)eisner.decus.org
cornelius(at)mayo.edu
.
- Follow-Ups:
- Re: Booting a poor-man's LAVC [was:Interesting cluster config "deadlock"]
- From: George Cornelius
- Re: Booting a poor-man's LAVC [was:Interesting cluster config "deadlock"]
- Prev by Date: Re: Noahs ark found!
- Next by Date: Re: reg:urgent req of hpopenvms
- Previous by thread: SSI clusters (VMS) now irrelevant?
- Next by thread: Re: Booting a poor-man's LAVC [was:Interesting cluster config "deadlock"]
- Index(es):
Relevant Pages
|