Re: Loosing all LAT connections (More answered questions)



johnwallace4@xxxxxxxxxxx wrote:
On Apr 17, 4:09 pm, JCamCMKRNL <jcam90...@xxxxxxxxxxxxx> wrote:
First, thanks to all who have responded. Your information has been
very valuable.
So far, we have not had another occurrence of these dropping of all
LAT connections on one system. Just the original three occurrences in
the past three weeks. The information on the LAT counters do seem to
indicate that the problem will occur again. It is just a mater of
when.

Several of you asked some more questions about this issue, so I have
gathered the questions and the answers below. I hope I have hit all of
your queries. In particular, I think the very last question here and
its answer is very important.
----------------> what does the current output of this show?

MCR NCP SHOW COUNTER KNOW CIRC

It is very clean:
Known Circuit Counters as of 17-APR-2009 06:30:10

Circuit = ISA-0

>65534 Seconds since last zeroed
0 Terminating packets received
0 Originating packets sent
0 Terminating congestion loss
0 Transit packets received
0 Transit packets sent
0 Transit congestion loss
0 Circuit down
0 Initialization failure
0 Adjacency down
0 Peak adjacencies
28945 Data blocks sent
1447250 Bytes sent
0 Data blocks received
0 Bytes received
0 Unrecognized frame destination
0 User buffer unavailable

Make this MC NCP SHOW KNOWN LINE COUNTERS
This is clean except some send failures/collisions:
Known Line Counters as of 17-APR-2009 06:31:45

Line = ISA-0

>65534 Seconds since last zeroed
1691897 Data blocks received
25491 Multicast blocks received
0 Receive failure
78211496 Bytes received
1529460 Multicast bytes received
0 Data overrun
2240057 Data blocks sent
37989 Multicast blocks sent
87 Blocks sent, multiple collisions
102 Blocks sent, single collision
1173 Blocks sent, initially deferred
107990422 Bytes sent
1729968 Multicast bytes sent
8030 Send failure, including:
Carrier check failed
8030 Collision detect check failure
0 Unrecognized frame destination
0 System buffer unavailable
0 User buffer unavailable

Jeff, you write that "the counters show no errors, but this was interesting from the MCR LATCP SHOW LINK/COUNT ...etc"
What counters show no errors?
Here is the complete output of the LAT LINK counters:
Link Name: LAT$LINK
Device Name: _EZA4:

Seconds Since Zeroed: 65535
Messages Received: 1693146
Multicast Msgs Received: 25517
Bytes Received: 78269314
Multicast Bytes Received: 1531020
System Buffer Unavailable: 0
Unrecognized Destination: 0

Messages Sent: 2241710
Multicast Msgs Sent: 38006
Bytes Sent: 108119723
Multicast Bytes Sent: 1730717
User Buffer Unavailable: 0
Data Overrun: 0

Receive Errors -
Block Check Error: No
Framing Error: No
Frame Too Long: No
Frame Status Error: No
Frame Length Error: No

Transmit Errors -
Excessive Collisions: No
Carrier Check Failure: Yes
Short Circuit: Yes
Open Circuit: Yes
Frame Too Long: Yes
Remote Failure To Defer: No
Transmit Underrun: Yes
Transmit Failure: No

CSMACD Specific Counters
------------------------

Transmit CDC Failure: 8030

Messages Transmitted -
Single Collision: 102
Multiple Collisions: 87
Initially Deferred: 1173

The transceiver on the DELNI wasn't replaced recently was it?
No. It is the original H4000 which was installed about 6 years ago.

You wrote that there is no 10BASE2 gear available. Is it possible to
use RJ45 transceivers (enable heartbeat), a low speed UTP switch or
hub with an AUI port on it?
I do have a Black Box switch with one AUI port, and 8 10-BaseT RJ45
ports available, but it requires change control paperwork to connect
it to the DEC Network. I would like to avaoid doing this.

If it turns out that it is not hardware, is it possible that there is a
PC (or other 100MB equipment) connected to the backbone somewhere?
No. At this time all equipment on the DEC Network is 100% Digital
Equipment (Not Compaq, not HP) hardware.

If you can get the DECserver counters easily ...
please post them; they may not add anything to the picture,
but they might.
Here are the results from one of the many DECServers.
Node LIMS is the VAX, all the others are PDP-11s running RSX.

Local> SHOW NODE ALL COUNTERS

Node: ALICE
Seconds Since Zeroed: 1985926
Messages Received: 1262
Messages Transmitted: 1133
Slots Received: 638
Slots Transmitted: 864
Bytes Received: 17554
Bytes Transmitted: 768

Multiple Node Addresses: 0
Duplicates Received: 0
Messages Re-transmitted: 6
Illegal Messages Received: 0
Illegal Slots Received: 0
Solicitations Accepted: 0
Solicitations Rejected: 0

Node: IRV70A
Seconds Since Zeroed: 2577531
Messages Received: 0
Messages Transmitted: 0
Slots Received: 0
Slots Transmitted: 0
Bytes Received: 0
Bytes Transmitted: 0

Multiple Node Addresses: 0
Duplicates Received: 0
Messages Re-transmitted: 0
Illegal Messages Received: 0
Illegal Slots Received: 0
Solicitations Accepted: 0
Solicitations Rejected: 0

Node: LIMS
Seconds Since Zeroed: 2577490
Messages Received: 122179
Messages Transmitted: 88449
Slots Received: 76864
Slots Transmitted: 65984
Bytes Received: 6861709
Bytes Transmitted: 494411

Multiple Node Addresses: 0
Duplicates Received: 6
Messages Re-transmitted: 1
Illegal Messages Received: 0
Illegal Slots Received: 0
Solicitations Accepted: 0
Solicitations Rejected: 0

Node: MINNIE
Seconds Since Zeroed: 2577505
Messages Received: 69814
Messages Transmitted: 66573
Slots Received: 13149
Slots Transmitted: 13931
Bytes Received: 779022
Bytes Transmitted: 15412

Multiple Node Addresses: 0
Duplicates Received: 0
Messages Re-transmitted: 0
Illegal Messages Received: 0
Illegal Slots Received: 0
Solicitations Accepted: 0
Solicitations Rejected: 0

You have spares for all the active network kit, right?
Yes. We are planning to start by swapping out the DELNI in the Data
Center to see if this helps.

Same goes for the MicroVAX itself. Do you have a spare network card (DELQA) you could plug in?
Yes. If the problem raises it's head again after swapping out the
DELNI then we will swap out the DELQA.
If it continues to fail after that, then we will swap out the H4000
transceiver.

Has anybody installed any significant new electrical kit on the
factory floor recently ?
No. The last physical change to the network was performed 2 months
before the first occurence of this VAX/LAT connection dropout problem,
and that change was just adding another DECServer 200 to another DELNI
in a remote IDF closet.

How much is the disruption costing you? Enough to make it worthwhile upgrading the backbone to modern technology.
The VAX system is used exclusively for interactive sessions by many of
our Medial Labs for the entry of laboratory test results for QA/QC
records for the FDA. As long as the interruptions are not prolonged,
there are no serious impacts to production. The PDP-11s are
considerably involved in production. They must stay up for production
to continue but they do not use the network for production operations.
The network for the PDP-11s is simply for production monitors to see
what is going on and to pass data to the VAX. We can take the network
down for planned upgrades or changes for up to 2-3 hours without
significant impact to production.

Forgot to say: wrt user sessions being dropped, you do know you
can avoid that using VMS's "virtual terminal" feature, right? When the
LAT session is dropped, the VMS session continues and can in principle
be resumed from where it left off once the user reconnects their
session.
I seem to remember this feature in VMS from along time ago, but for
some reason when the LAT connections are dropped, all the interactive
processes are stopped and the users are logged out.

When the LAT service of the VMS machine had disappeared from the
DECservers, did you still try to do connect to the service somehow (e.g.
SET H /LAT <VAX>)? If you did, was it successful?
I just enabled outgoing LAT on the VAX, and I can do a SET HOST/LAT
and it works now, but the problem of all LAT connections dropping has
not happened again since I enabled outgoing LAT.

You know what they say, put a monkey in front of a keyboard and
eventually he'll come up with something intelligent.
We have had hundreds of Monkeys using this network for about 30 years.
So far no sign of intelligence.
==========================
Thanks again for all of your input.

Jeff Cameron

Thanks for the update.

Should we be pleased it hasn't happened again? Or did the fault
perhaps happen again anyway, just with less visible consequences ?

The VAX line counters and the counters from the DECservers all see
carrier detect check failures; the DECservers even say open circuit
detected and short circuit detected. Something's broken or
misconfigured (well duh!) and because the interconnects are DELNIs and
coax there's no current way to localise the problem (no repeaters,
switches, or bridges dividing the network into separate fault
domains); wherever it is, it will likely be visible everywhere on that
LAN.

One suggestion I would make is that you zero the DECnet counters
periodically so that you have at least some idea of how old the
numbers are. The architected way of doing this is the counter timer
associated with the object, eg MC NCP SET LINE ISA-0 COUNTER TIMER
14400 ! reset counters every four hours, or whatever.

In order to make this useful you then also need to have DECnet logging
set up to record the counter values before they are zeroed. It can be
done reasonably simply, but right now I can't remember the details, so
if you haven't done that logging kind of thing before you may find a
simple DCL loop in a batch job easier to get working:

$ loop:
$ MC NCP SHOW LINE ISA-0 COUNT
$ MC NCP ZERO LINE ISA-0 COUNT
$ wait 1:00:00 ! or whatever
$ goto loop

If you're a stickler for tidiness you may want to subtract a couple of
seconds from your nice round interval so it doesn't drift later as the
days go by, you may want to make the batch job autorestart, etc. Usual
stuff.

Once you have this set up you can keep an eye on the counters and if
the relevant ones are non-zero you know you have had a hiccup, and you
know very roughly when it occurred. E.g. in case the problem sometimes
occurs invisibly, without causing a complete collapse of LAT sessions.

There's probably a way of logging individual carrier check failure
events *when they happen* too (or at least close to when they happen),
rather than checking counters every few hours, if you don't mind a bit
of programming; whether it's worth going to the trouble of doing that
is for you to decide, there may be other better uses for your time.

In an ideal world you'd do something equivalent for the DECserver
counters, but since everything's currently on the same LAN segment
(electrically) it probably doesn't matter.

By the way, nice to hear that you wouldn't be able to touch the
network hardware without getting change control authorisation. You
might (or might not) be amazed how many people don't bother with that
kind of thing.


People who know, and who care about having a job next week, frequently do use change control! It really helps to plan a change before making it and to document exactly what you are going to do. A really good change control system will require you to plan how to back out your change in case it should have disastrous results. The larger your organization and the more different sites you have, the more important change control becomes.


.



Relevant Pages

  • Re: Loosing all LAT connections (More answered questions)
    ... LAT connections on one system. ... The information on the LAT counters do seem to ... At this time all equipment on the DEC Network is 100% Digital ...
    (comp.os.vms)
  • Re: Server Performance
    ... Check free disk space; delete Temp files/ Temporary Internet files & ... Remove unnecessary network protocols (NWLink IPX/SPX is often at fault ... > using Performance Logs and Monitor to monitor the server for bottlenecks, ... > but I am not sure which all counters to concentrate on. ...
    (microsoft.public.windows.server.general)
  • [PATCH] networking: Add write/clear capability to /proc/net/dev
    ... There is currently no way to reset networking statistics other than to ... This patch adds the capability to clear and/or set network statistics ... Clears rx and tx bytes and packets counters. ... * you want to set consecutive elements, DO NOT use a comma separated list! ...
    (Linux-Kernel)
  • Re: ASP.NET performance counters not updating
    ... I changed the process identity to Local System and the ASP.NET counters are ... So for some reason it looks like the Network Service account does ... not have the correct permissions to update the performance counters. ...
    (microsoft.public.dotnet.framework.aspnet)
  • Re: Millennium Cultist Storyline Rules
    ... Methuselah A,B,C and D all put counters on the Reverned during their ... Meth E calls the referendum to change control of Reverend. ... automatic 4 votes in favour of Meth E gaining control of the Reverend? ...
    (rec.games.trading-cards.jyhad)