Re: Loosing all LAT connections (More answered questions)
- From: "Richard B. Gilbert" <rgilbert88@xxxxxxxxxxx>
- Date: Sun, 19 Apr 2009 19:20:49 -0400
johnwallace4@xxxxxxxxxxx wrote:
On Apr 17, 4:09 pm, JCamCMKRNL <jcam90...@xxxxxxxxxxxxx> wrote:First, thanks to all who have responded. Your information has been
very valuable.
So far, we have not had another occurrence of these dropping of all
LAT connections on one system. Just the original three occurrences in
the past three weeks. The information on the LAT counters do seem to
indicate that the problem will occur again. It is just a mater of
when.
Several of you asked some more questions about this issue, so I have
gathered the questions and the answers below. I hope I have hit all of
your queries. In particular, I think the very last question here and
its answer is very important.
----------------> what does the current output of this show?
MCR NCP SHOW COUNTER KNOW CIRC
It is very clean:
Known Circuit Counters as of 17-APR-2009 06:30:10
Circuit = ISA-0
>65534 Seconds since last zeroed
0 Terminating packets received
0 Originating packets sent
0 Terminating congestion loss
0 Transit packets received
0 Transit packets sent
0 Transit congestion loss
0 Circuit down
0 Initialization failure
0 Adjacency down
0 Peak adjacencies
28945 Data blocks sent
1447250 Bytes sent
0 Data blocks received
0 Bytes received
0 Unrecognized frame destination
0 User buffer unavailable
Make this MC NCP SHOW KNOWN LINE COUNTERSThis is clean except some send failures/collisions:
Known Line Counters as of 17-APR-2009 06:31:45
Line = ISA-0
>65534 Seconds since last zeroed
1691897 Data blocks received
25491 Multicast blocks received
0 Receive failure
78211496 Bytes received
1529460 Multicast bytes received
0 Data overrun
2240057 Data blocks sent
37989 Multicast blocks sent
87 Blocks sent, multiple collisions
102 Blocks sent, single collision
1173 Blocks sent, initially deferred
107990422 Bytes sent
1729968 Multicast bytes sent
8030 Send failure, including:
Carrier check failed
8030 Collision detect check failure
0 Unrecognized frame destination
0 System buffer unavailable
0 User buffer unavailable
Jeff, you write that "the counters show no errors, but this was interesting from the MCR LATCP SHOW LINK/COUNT ...etc"Here is the complete output of the LAT LINK counters:
What counters show no errors?
Link Name: LAT$LINK
Device Name: _EZA4:
Seconds Since Zeroed: 65535
Messages Received: 1693146
Multicast Msgs Received: 25517
Bytes Received: 78269314
Multicast Bytes Received: 1531020
System Buffer Unavailable: 0
Unrecognized Destination: 0
Messages Sent: 2241710
Multicast Msgs Sent: 38006
Bytes Sent: 108119723
Multicast Bytes Sent: 1730717
User Buffer Unavailable: 0
Data Overrun: 0
Receive Errors -
Block Check Error: No
Framing Error: No
Frame Too Long: No
Frame Status Error: No
Frame Length Error: No
Transmit Errors -
Excessive Collisions: No
Carrier Check Failure: Yes
Short Circuit: Yes
Open Circuit: Yes
Frame Too Long: Yes
Remote Failure To Defer: No
Transmit Underrun: Yes
Transmit Failure: No
CSMACD Specific Counters
------------------------
Transmit CDC Failure: 8030
Messages Transmitted -
Single Collision: 102
Multiple Collisions: 87
Initially Deferred: 1173
The transceiver on the DELNI wasn't replaced recently was it?No. It is the original H4000 which was installed about 6 years ago.
You wrote that there is no 10BASE2 gear available. Is it possible toI do have a Black Box switch with one AUI port, and 8 10-BaseT RJ45
use RJ45 transceivers (enable heartbeat), a low speed UTP switch or
hub with an AUI port on it?
ports available, but it requires change control paperwork to connect
it to the DEC Network. I would like to avaoid doing this.
If it turns out that it is not hardware, is it possible that there is aNo. At this time all equipment on the DEC Network is 100% Digital
PC (or other 100MB equipment) connected to the backbone somewhere?
Equipment (Not Compaq, not HP) hardware.
If you can get the DECserver counters easily ...Here are the results from one of the many DECServers.
please post them; they may not add anything to the picture,
but they might.
Node LIMS is the VAX, all the others are PDP-11s running RSX.
Local> SHOW NODE ALL COUNTERS
Node: ALICE
Seconds Since Zeroed: 1985926
Messages Received: 1262
Messages Transmitted: 1133
Slots Received: 638
Slots Transmitted: 864
Bytes Received: 17554
Bytes Transmitted: 768
Multiple Node Addresses: 0
Duplicates Received: 0
Messages Re-transmitted: 6
Illegal Messages Received: 0
Illegal Slots Received: 0
Solicitations Accepted: 0
Solicitations Rejected: 0
Node: IRV70A
Seconds Since Zeroed: 2577531
Messages Received: 0
Messages Transmitted: 0
Slots Received: 0
Slots Transmitted: 0
Bytes Received: 0
Bytes Transmitted: 0
Multiple Node Addresses: 0
Duplicates Received: 0
Messages Re-transmitted: 0
Illegal Messages Received: 0
Illegal Slots Received: 0
Solicitations Accepted: 0
Solicitations Rejected: 0
Node: LIMS
Seconds Since Zeroed: 2577490
Messages Received: 122179
Messages Transmitted: 88449
Slots Received: 76864
Slots Transmitted: 65984
Bytes Received: 6861709
Bytes Transmitted: 494411
Multiple Node Addresses: 0
Duplicates Received: 6
Messages Re-transmitted: 1
Illegal Messages Received: 0
Illegal Slots Received: 0
Solicitations Accepted: 0
Solicitations Rejected: 0
Node: MINNIE
Seconds Since Zeroed: 2577505
Messages Received: 69814
Messages Transmitted: 66573
Slots Received: 13149
Slots Transmitted: 13931
Bytes Received: 779022
Bytes Transmitted: 15412
Multiple Node Addresses: 0
Duplicates Received: 0
Messages Re-transmitted: 0
Illegal Messages Received: 0
Illegal Slots Received: 0
Solicitations Accepted: 0
Solicitations Rejected: 0
You have spares for all the active network kit, right?Yes. We are planning to start by swapping out the DELNI in the Data
Center to see if this helps.
Same goes for the MicroVAX itself. Do you have a spare network card (DELQA) you could plug in?Yes. If the problem raises it's head again after swapping out the
DELNI then we will swap out the DELQA.
If it continues to fail after that, then we will swap out the H4000
transceiver.
Has anybody installed any significant new electrical kit on theNo. The last physical change to the network was performed 2 months
factory floor recently ?
before the first occurence of this VAX/LAT connection dropout problem,
and that change was just adding another DECServer 200 to another DELNI
in a remote IDF closet.
How much is the disruption costing you? Enough to make it worthwhile upgrading the backbone to modern technology.The VAX system is used exclusively for interactive sessions by many of
our Medial Labs for the entry of laboratory test results for QA/QC
records for the FDA. As long as the interruptions are not prolonged,
there are no serious impacts to production. The PDP-11s are
considerably involved in production. They must stay up for production
to continue but they do not use the network for production operations.
The network for the PDP-11s is simply for production monitors to see
what is going on and to pass data to the VAX. We can take the network
down for planned upgrades or changes for up to 2-3 hours without
significant impact to production.
Forgot to say: wrt user sessions being dropped, you do know youI seem to remember this feature in VMS from along time ago, but for
can avoid that using VMS's "virtual terminal" feature, right? When the
LAT session is dropped, the VMS session continues and can in principle
be resumed from where it left off once the user reconnects their
session.
some reason when the LAT connections are dropped, all the interactive
processes are stopped and the users are logged out.
When the LAT service of the VMS machine had disappeared from theI just enabled outgoing LAT on the VAX, and I can do a SET HOST/LAT
DECservers, did you still try to do connect to the service somehow (e.g.
SET H /LAT <VAX>)? If you did, was it successful?
and it works now, but the problem of all LAT connections dropping has
not happened again since I enabled outgoing LAT.
You know what they say, put a monkey in front of a keyboard andWe have had hundreds of Monkeys using this network for about 30 years.
eventually he'll come up with something intelligent.
So far no sign of intelligence.
==========================
Thanks again for all of your input.
Jeff Cameron
Thanks for the update.
Should we be pleased it hasn't happened again? Or did the fault
perhaps happen again anyway, just with less visible consequences ?
The VAX line counters and the counters from the DECservers all see
carrier detect check failures; the DECservers even say open circuit
detected and short circuit detected. Something's broken or
misconfigured (well duh!) and because the interconnects are DELNIs and
coax there's no current way to localise the problem (no repeaters,
switches, or bridges dividing the network into separate fault
domains); wherever it is, it will likely be visible everywhere on that
LAN.
One suggestion I would make is that you zero the DECnet counters
periodically so that you have at least some idea of how old the
numbers are. The architected way of doing this is the counter timer
associated with the object, eg MC NCP SET LINE ISA-0 COUNTER TIMER
14400 ! reset counters every four hours, or whatever.
In order to make this useful you then also need to have DECnet logging
set up to record the counter values before they are zeroed. It can be
done reasonably simply, but right now I can't remember the details, so
if you haven't done that logging kind of thing before you may find a
simple DCL loop in a batch job easier to get working:
$ loop:
$ MC NCP SHOW LINE ISA-0 COUNT
$ MC NCP ZERO LINE ISA-0 COUNT
$ wait 1:00:00 ! or whatever
$ goto loop
If you're a stickler for tidiness you may want to subtract a couple of
seconds from your nice round interval so it doesn't drift later as the
days go by, you may want to make the batch job autorestart, etc. Usual
stuff.
Once you have this set up you can keep an eye on the counters and if
the relevant ones are non-zero you know you have had a hiccup, and you
know very roughly when it occurred. E.g. in case the problem sometimes
occurs invisibly, without causing a complete collapse of LAT sessions.
There's probably a way of logging individual carrier check failure
events *when they happen* too (or at least close to when they happen),
rather than checking counters every few hours, if you don't mind a bit
of programming; whether it's worth going to the trouble of doing that
is for you to decide, there may be other better uses for your time.
In an ideal world you'd do something equivalent for the DECserver
counters, but since everything's currently on the same LAN segment
(electrically) it probably doesn't matter.
By the way, nice to hear that you wouldn't be able to touch the
network hardware without getting change control authorisation. You
might (or might not) be amazed how many people don't bother with that
kind of thing.
People who know, and who care about having a job next week, frequently do use change control! It really helps to plan a change before making it and to document exactly what you are going to do. A really good change control system will require you to plan how to back out your change in case it should have disastrous results. The larger your organization and the more different sites you have, the more important change control becomes.
.
- Follow-Ups:
- Change Control (was:Re: Loosing all LAT connections)...
- From: Bradford Hamilton
- Change Control (was:Re: Loosing all LAT connections)...
- References:
- Loosing all LAT connections on one machine in DECNet Network
- From: JCamCMKRNL
- Re: Loosing all LAT connections on one machine in DECNet Network
- From: Volker Halle
- Re: Loosing all LAT connections on one machine in DECNet Network
- From: johnwallace4
- Re: Loosing all LAT connections on one machine in DECNet Network
- From: H Vlems
- Re: Loosing all LAT connections on one machine in DECNet Network
- From: Volker Halle
- Re: Loosing all LAT connections on one machine in DECNet Network
- From: JF Mezei
- Re: Loosing all LAT connections on one machine in DECNet Network
- From: Jur van der Burg
- Re: Loosing all LAT connections (More answered questions)
- From: JCamCMKRNL
- Re: Loosing all LAT connections (More answered questions)
- From: johnwallace4
- Loosing all LAT connections on one machine in DECNet Network
- Prev by Date: Re: Loosing all LAT connections (More answered questions)
- Next by Date: Change Control (was:Re: Loosing all LAT connections)...
- Previous by thread: Re: Loosing all LAT connections (More answered questions)
- Next by thread: Change Control (was:Re: Loosing all LAT connections)...
- Index(es):
Relevant Pages
|