Re: Production VMS cluster hanging with lots of LEFO



On Mar 13, 5:32 am, filip.debl...@xxxxxxxxxxxx wrote:
Greetings.

Yesterday we had a massive incident on our most important VMS
machines.

Production is configured as a disaster tolerant cluster containing
four
identical midsize alphas. These are grouped two-by-two into two
computerrooms, separated by more than 25 km. Connections between them
is
done by a four-fold extreme high capacity network, which is also
shared by a
massive army of UN*X boxes.

A fifth quorum node (small thing, only has to be present) sits in a
third
room.

The application that is running on the cluster is ACMS driven and is
quite
stable : everything is installed in memmory, takes up on avarage max
10-15%
cpu, and has memory to burn, so outswapped processes are extermely
rare.
This application accesses a monster SYBASE database, which is running
on a
UN*X box (did I mention the things was disaster tolerant ? :-(

OS is VMS 8.3, we run DECNET over IP.

Previous night, some "load test" was done on the network. Not a lot is
known
about that, but it is believed it included the links between the two
sites.
I was not aware of this thing being done, and it would probably have
been
none of my concern.

Very soon alarms started to come in stating users could not login
anymore,
neither over the dedicated TCP/IP interfaces (using some
application-to-application mechanism), neither via whatever SET HOST,
TELNET, etc.

Fortunately I always keep some sessions open on by station (not part
of the
cluster), which were still working. The system was NOT down.

When looking at the first system, I immediately remarked a significant
number
of LEFO processes, most of them related to individual (DCL) users,
having
close to 0 CPU time and IO. I also spotted one HIBO (REMACP !).
I was able to STOP/ID all the LEFOs (did not touch REMACP), to no
avail.
When trying to find the real identity of a user (MC AUTHORIZE), my
session
froze.

In a second session (on an other machine of the cluster), session got
iced
as well during a DCL command.

I got worried.

It seemed that it was not possible to run an image anymore. (a lot of
DCL
command do startup an image) Very soon I lost control from _all_
sessions,
but before that I was able to notice :
- the cluster was fine (all 5 machines up, all participating with 1
vote)
- there was at least one looping process (happens all the time, we
simply
kill them)
- (not 100% sure of this) most of the LEFO processes where attempts to
login, trying to run LOGINOUT.EXE (just another image ...)

So SNAFU

It was found out later, by some (external) database monitoring, that
at
least one of the looping processes (image was already running by the
time
the problems started) did do some DB activity, so the VMS process was
not
aware of any problems and happyly kept looping.

A desparate try to login to console (console monitoring is running an
separate node) yielded no success. It appeared that all machines
(including
quorum node) were inaccessible (but not dead !)

Somewhat later it was claimed that the network modifications (?) were
rolled
back. VMS cluster did not recover by itself.

Finally (we need zillions of authorizations for everything) the quorum
node
was crashed.
And I was happyly looking at >>>

First boot failed due to bizar (and unrelated problems), but booting
as MIN
did work. I was able to login into the quorun node (via console of
course)

Miracle happened. All LEFO disappeared and the beast went back to
business.
Most processes simply continued from the point where they were
blocked, no
damage (except for part of the application which had timed out, a
simple
restart solved this).

Unfortunately I did not check if the situation was normalised because
of the
crash of the quorum node. I only observed 'back to business' after the
minimal reboot.

Now, 24 hours later, things are as normal as allways.

A lot of unknowns are still left.

Q : what caused the image activator to go into LEFO (actual to remain
in
LEFO). At some point during image activation (last phase ?) it starts
waiting for an eventflag. What could be setting that event flag ? I am
suspecting it never came ...

Q: crashing (and rebooting) the quorum node solved things immediately.
Could
this be caused by a lock held by the quorum node ? if so, is this a
lock that is
related to cluster transitions ?

Q : would we have had the same effect by crashing/rebooting anyone of
the
other nodes ?

And finally :

Can some form of (minor ?) network outage trigger events like this?

Any takers ?

advTHANKSance

Filip,

In essence, I concur with Vaxman. There are few ways to diagnose this
problem without a system dump (or live access during such an event to
SDA, the utility invoked by the ANALYZE/SYSTEM command).

LEF (and LEFO) are completely normal wait states. A SHOW SYSTEM on a
normal day will show many processes in this state. The "O" means that
the process is outswapped. Due to today's large memories, outswapped
is less often seen, but it is also fairly normal. One will also see
normally see many processes in the HIB state, these have executed the
$HIBERNATE system service and are waiting for some event to awake.

That much is completely normal. The quorum machine crashing could (or
could not) be part of the answer, as could the cluster continuing on
its way. If the cluster quorum machine left a dump when it crashed,
analyzing that dump might indicate why processes were freezing.

If the cluster communications channel (e.g., the lan between the
machines) were being disrupted in some way, strange things can happen.
In essence, the connection is presumed to be a simple IEEE 802.3
Ethernet. If someone were to put routing or traffic shaping devices
into that path, it could produce interesting results. This is
particularly true if some security device is removing packets from the
stream for some reason.

The crash dumps are vital. A knowledge of precisely what was happening
on the network would be useful, but it would have to be a complete
list, without any "incidental" changes omitted.

- Bob Gezelter, http://www.rlgsc.com
.



Relevant Pages

  • Production VMS cluster hanging with lots of LEFO
    ... Yesterday we had a massive incident on our most important VMS ... The application that is running on the cluster is ACMS driven and is ... of LEFO processes, most of them related to individual users, ... quorum node) were inaccessible ...
    (comp.os.vms)
  • Re: Production VMS cluster hanging with lots of LEFO
    ... Yesterday we had a massive incident on our most important VMS ... The application that is running on the cluster is ACMS driven and is ... of LEFO processes, most of them related to individual users, ... quorum node) were inaccessible ...
    (comp.os.vms)
  • Re: Production VMS cluster hanging with lots of LEFO
    ... Yesterday we had a massive incident on our most important VMS ... The application that is running on the cluster is ACMS driven and is ... of LEFO processes, most of them related to individual users, ... quorum node) were inaccessible ...
    (comp.os.vms)
  • Re: e3k upgrade - hardware spec
    ... > Xeon 2.8Ghz processor with 2Gb of RAM. ... > The front end one is a slower machines which is currently setup to send all ... We also have NetApp IP SAN so the new cluster will have its ... With Exchange 12 on the horizon, you'll probably want to take advantage ...
    (microsoft.public.exchange.setup)
  • Re: Does RISC still offer a significant numbercrunching advantage?
    ... On other tasks, the cluster will blow ... the SMP out of the water. ... processor SMP) machines? ... Rendering movie frames is embarrassingly parallel once everything is laid out, so there's no need for massive SMP; ...
    (comp.arch)