Re: SEPPUCLU bugcheck introducing new cluster node



Tom,

lets first rule out the suspicion regarding LOCKIDTBL being too small.
There was a known problem in V7.2-1 causing a LOCKMGRERR crash at
LCK$ALLOC_PAGE_C+2B8 - this was solved by VMS721_FIBRE_SCSI-V0200 and
also in V7.3 or higher, so it does not apply to your crashes.

MONI LOCK on your nodes will show the no. of locks on each of them, if
you want to prevent LOCKIDTBL from expanding, you need to set it higher
than the max. no. of locks seen. @AUTOGEN SAVPARAMS SETPARAMS FEEDBACK
will do this for you. It will also take care of RESHASHTBL - check the
SYS$SYSTEM:AGEN$PARAMS.REPORT file.

I would have expected RSB$L_HASHVAL to be non-zero and have the same
value as LKMSG$L_HASHVAL - I have to think about how these values could
possibly be different.

The LOCKMGRERR on TROI (or one of the other nodes) happens, because it
has recieved a 'bad LKMSG' from WORF. The 'bad' part in this message
was LKMSG$L_HASHVAL; becasue using this hash value did not find the
existing resouce block. Is WORF booting from the same system disk, i.e.
does it have the same patch status ? If there would be something wrong
with the patches on your cluster, wouldn't you expect to see the same
type of crashes between the other nodes ?

Did the new node work before ? Can you boot it standalone and run some
tests involving the LAN interface (DTSEND, NCP/NCL LOOP tests) ?

What we can try to do is to manaully repeat the steps (in the dump)
OpenVMS takens to locate the resource after receiving the
LKMSG$K_RMVDIR message.

It uses the HASH value (after shifting) to locate the hash chain in the
resource hash table. It then walks that chain to find the resource:

LKMSG$L_HASHVAL = 4F02A47B

SDA> eva (4F02A47B@-(^d32-@lck$gl_htblcnt))*8
Hex = 00000000.0000xxxx

SDA> exa @^qlck$gq_hashtbl+xxxx
FFFFFFFF.yyyyyyyy: FFFFFFFF.7EDB75C0

SDA> vali que/sin/list/quad FFFFFFFF.yyyyyyyy

Entry Address Flink
----- ------- -----
Header FFFFFFFF.7F7B22B8 FFFFFFFF.7EDB75C0
1. FFFFFFFF.zzzzzzzz 00000000.00000000

Queue is zero-terminated, total of 1 element in the queue

For each address shown in column 2 (excluding the Header...) line, do

SDA> SHOW RES/ADDR=FFFFFFFF.zzzzzzzz

I have tested this on V8.2, but it should also work on V7.3-1.

Volker.

.