Re: SEPPUCLU bugcheck introducing new cluster node
- From: "Volker Halle" <volker_halle@xxxxxxxxxxx>
- Date: 23 Aug 2006 10:39:23 -0700
Tom,
SDA> exa @r5+cdrp$l_val13
FFFFFFFF.81BBDC20: 00000000.00000001 "........"
SDA>
Looks suspiciously like SS$_Normal
No, it's not a system service code, but a sequential error number for
the identification of these LOCKMGRERR/SEPPUCLU errors.
1 indicates, that the hash did not match. The remote system is trying
to remove an existing root resource. That resource should exist
locally, but when trying to look up the resource using the resource
name in the received LKMSG$K_RMVDIR lock message and the hash value
from that message, this resource could not be found in the local
resource database.
Here are the relevant fields from the received LCKMSG:
FFFFFFFF.81D2ABCC LKMSG$B_RSNLEN 16000000 len of resnam =
22.
FFFFFFFF.81D2ABD0 LKMSG$T_RESNAM 42313146 F11B
FFFFFFFF.81D2ABD4 LKMSG$L_EPIDCVT 59536124 $aSY
FFFFFFFF.81D2ABD8 LKMSG$L_DLCKPRI_CVT 53494453 SDIS
FFFFFFFF.81D2ABDC 2020314B
K1
FFFFFFFF.81D2ABE0 56212020
FID: 22049
FFFFFFFF.81D2ABE4 00000000
....
FFFFFFFF.81D2AC20 LKMSG$L_HASHVAL 4F02A47B
Ideally, you would try to find that message in the other node's
SEPPUCLU crash to find out, if the complete message has been received
correctly from the remote node. As you cannot analyze that crash, we
have to speculate...
Note that the LKMSG$L_HASHVAL field is the LAST longword in the
received message ! If you could find that resource in the local node's
resource database, but the hash lookup did NOT find it, the hash value
in the received lock message could have been clobbered while passing
through the network.
SDA> SET OUTPUT file
SDA> SHOW RESOURCE/BRIEF
SDA> EXIT
$ SEA file F11B$aSYSDISK1
If you find resources with that name, use SDA> SHOW
RES/ADDR=<rsb-address-in-column-1> to find the correct resource - look
for the '5621' in the File ID field. If you can find that resource in
the local resource db in memory and the hash lookup routine could NOT
find that resource, then the probability is very high, that the last
bytes of the message got clobbered somewhere.
Believe it or not, I just had a call from another customer this
afternoon, who had crashed 2 nodes in a cluster by booting a test
system system into the cluster after changing the network
infrastructure (adding GBit netword cards and GBit switches). They were
also experiencing LOCKMGRERR/SEPPUCLU pairs of crashes ;-)
So yes, it's true: "we engineers will have our fun"
Volker.
.
- Follow-Ups:
- Re: SEPPUCLU bugcheck introducing new cluster node
- From: Tom Wade
- Re: SEPPUCLU bugcheck introducing new cluster node
- References:
- SEPPUCLU bugcheck introducing new cluster node
- From: Tom Wade
- Re: SEPPUCLU bugcheck introducing new cluster node
- From: Volker Halle
- Re: SEPPUCLU bugcheck introducing new cluster node
- From: Tom Wade
- Re: SEPPUCLU bugcheck introducing new cluster node
- From: Volker Halle
- Re: SEPPUCLU bugcheck introducing new cluster node
- From: Tom Wade
- Re: SEPPUCLU bugcheck introducing new cluster node
- From: Volker Halle
- Re: SEPPUCLU bugcheck introducing new cluster node
- From: Tom Wade
- Re: SEPPUCLU bugcheck introducing new cluster node
- From: Volker Halle
- Re: SEPPUCLU bugcheck introducing new cluster node
- From: Tom Wade
- Re: SEPPUCLU bugcheck introducing new cluster node
- From: Volker Halle
- Re: SEPPUCLU bugcheck introducing new cluster node
- From: Tom Wade
- SEPPUCLU bugcheck introducing new cluster node
- Prev by Date: Re: Alpha remembrance day
- Next by Date: Re: KZPBA-CB and PWS 433au problems
- Previous by thread: Re: SEPPUCLU bugcheck introducing new cluster node
- Next by thread: Re: SEPPUCLU bugcheck introducing new cluster node
- Index(es):