Re: SHADDETINCON, SHADOWING detects inconsistent state



On 1 jan, 12:56, hel...@xxxxxxxxxxxxxxxxxxxxxxxx (Phillip Helbig---
remove CLOTHES to reply) wrote:
My hobbyist cluster currently consists of:

   VAX 4000-105A
   VAXstation 4000-90A
   DEC 3000 - M600

Each system has a 2- or 3-member shadow set as its system disk.  There
are some non-shadowed disks (including CD-ROMs) and some 2-member shadow
sets distributed among the nodes (each member has a direct connection to
only one machine).  In particular, DISK$USER has members on each of the
VAXes.  I haven't changed much in 2 or 3 years.

Starting several weeks ago, and becoming more frequent in the last
couple of weeks, the VAX 4000-105A spontaneously reboots.  Even though
SHADOW_MBR_TMO is set to 10 minutes and MVTIMEOUT to one hour
(SHADOW_SYS_TMO is 2 minutes but that isn't relevant here), after such a
reboot everything looks OK on the VAX 4000-105A but on (usually just one
of) the other machines, the system-disk shadow set and the CD-ROM on the
VAX 4000-105A and the DISK$USER shadow set have gone into mount-verify
timeout.  This has always happened during the night, so I don't know how
long the spontaneous reboot takes.  I can just dismount and remount the
system-disk shadow set and the CD-ROM on the VAX 4000-105A from the
other nodes, but since DISK$USER has gone into mount-verify timeout, I
have to reboot the corresponding node.  (Note that SYSUAF etc are all on
DISK$USER.)  I can't dismount it since it contains open files.  I
haven't tried DISMOUNT/ABORT in such a situation.  Should I?  With
DISK$USER inaccessible, various applications will fail.  A reboot is
probably quicker than getting everything going again by hand.  (If it is
the VAXstation 4000-90A which needs to be rebooted, then I can dismount
and remount the member of DISK$USER on it from the ALPHA, so that I get
just a minicopy when the VAXstation 4000-90A comes back up.)

Note that everytime this has happened, DISK$USER was in the shadow-copy
state, copying from the member on the VAXstation 4000-90A to the member
on the VAX 4000-105A---even if DISK$USER as a shadow set isn't
accessible to the VAXstation 4000-90A and its members show up only as
remote shadow members.

I doubt it is possible to avoid these problems without creating more as
long as the spontaneous reboots are happening.  However, I want to get
rid of the spontaneous reboots.  ANALYZE/CRASH says:

      OpenVMS (TM) VAX System dump analyzer

   Dump taken on  1-JAN-2009 06:04:26.14
   SHADDETINCON, SHADOWING detects inconsistent state

HELP/MESSAGE says:

 SHADDETINCON,  SHADOWING detects inconsistent state

  Facility:     BUGCHECK, System Bugcheck

  Explanation:  The volume shadowing software reached an irrecoverable or
                inconsistent state because a shadow set failed an internal
                consistency check.

  User Action:  Note the conditions leading to the error and contact a Compaq
                support representative. If the system is configured to produce
                a memory dump, retain the dump file.

I don't see how I can "Note the conditions leading to the error".

Since the hardware setup hasn't changed in years, and since I'm not
seeing any additional errors, my assumption is that the VAX 4000-105A
is acting up.  Fortunately, I have an identical spare (thanks Hans!), so
I plan to swap the machines today.  If the problem goes away, then
presumably there was a fault with the machine, but who knows what it
could be.

Actually, I can't swap out everything since I put all the memory for the
VAX 4000-105A I have (128 MB) in the one currently in the cluster, so I
will remove it and put it in the spare.  I don't think this is a problem
with the memory.

Any further suggestions?

Actually I don't think power is an issue. Phillip lives in Germany and
mains power is rather reliable.
My suggestion would be to have a good look at the network. The systems
mentioned are all
10 Mb/s systems. So they're connected with thinwire coax or have UTP
transceivers on their AUI ports.
(Unless Phillip runs a 10BASE5 "classic" ethernet in his house, which
I doubt :-)
Thinwire cables may go bad and so may transceivers fail. As I read the
original post, the number of errors seems
to increase over time which indicates failing hardware.
Look at the error counters of other protocols, such as LAT, DECnet or
IP. Especially LAT since it is very sensitive to problems
in the physical layer and may provide you with clues.
Happy New Year !
Hans
.



Relevant Pages

  • Re: errors on shadow sets and their members
    ... >a member of a shadow set. ... One possibility is a parity error on a disk block. ...
    (comp.os.vms)
  • Re: How to make a shadowed system disk
    ... > then run autogen and reboot. ... There is a shadow doc set you can buy when you ... > member will build with a copy operation, ... add members to the system disk shadow set. ...
    (comp.os.vms)
  • Re: How does Shadow Copy REALLY work?
    ... > See the Digital Technical Journal article by Scott Davis entitled ... shadow member, does significantly reduce the time to "copy". ... the shadow set). ...
    (comp.os.vms)
  • Re: Errors during shadow set merge
    ... I have been getting errors during shadow set merges since I bought ... If you don't have a service contract, order a replacement disk and get a ... I am skeptical that its the disks (In my original message, ...
    (comp.os.vms)
  • Re: Shadow set problem finally solved
    ... pure-vanilla problem of a shadow set having a forced ... error that VMS replicates on every shadow copy, ... Break out a member. ...
    (comp.os.vms)