Re: SCO 5.0.7 MP5 network hung up
- From: Bela Lubkin <filbo@xxxxxxxxxx>
- Date: Sun, 1 Mar 2009 14:36:22 -0800
Steve Fabac wrote:
In any case, whether this cleans up the leak and whether it awakens theNo.
network are also two separate things. So take note of both: Does it
reduce STREAMS memory allocation?
Does it wake the net back up?
No.
Oh well, it was worth a shot.
I was on-site at the clients on 2/26 to reload 5.0.7 on the primary
machine. During netconfig, I selected the second NIC on the system
board to move the NIC to IRQ11 (previously both systems were installed
using the NIC that exists on the same IRQ (IRQ10) as the Adaptec
aacraid controller.
Long story short, I went around the rack to move the LAN
cable and mistakenly moved the cable on the production server
instead of the server I was reloading. I was immediately called
with the news that everyone's session was locked. I moved the LAN
cable back and people were able to log back in to the system.
However, within 10 minutes, all telnet sessions were terminated
and new telnet connections were being refused.
Feb 26 12:47:27 failover NOTICE: bcme0 (slot:0 port:1): Link is down
Feb 26 12:48:51 failover NOTICE: bcme0 (slot:0 port:1): Link is up (1000Mbps, Full Duplex)
I quickly typed in the above commands as testnet.sh and ran it to create
/tmp/before, during, and after.
So wait a sec -- in this posting you are now asking about a different
(but similar) behavior on the other host? That really serves to confuse
the heck out of the issue.
$ ls -lt /tmp/before /tmp/during /tmp/after
-rw-rw-r-- 1 root sys 6900 Feb 26 13:01 /tmp/after
-rw-rw-r-- 1 root sys 6900 Feb 26 13:01 /tmp/during
-rw-rw-r-- 1 root sys 6900 Feb 26 13:00 /tmp/before
The following is the diff3 results of the before, during and after
log files (I replaced numbers that did not change in the counts
with "+" to make it easer to see the changes):
I actually felt that blotting out digits of single numbers made it
harder, not easier to compare. i.e. if it said "1234 ... 1239' that
would be easier to read than "+++4 ... +++9". Blotting out entire
entries that don't change seems sort of reasonable, but even that isn't
really necessary.
Anyway, you could have replaced all of this with "none of the numbers
changed by any material amount".
The net0 down/up sequence in testnet.sh did not correct the network lock up.
The system was still inaccessible via the network.
That suggests a bug in the Broadcom driver (bcme). Though it's hard to
say.
I was able to get the network back up by executing "/etc/tcp stop"
followed by "/etc/tcp start."
That's an interesting observation. You can use it: /etc/tcp is just a
shell script. You can extract, either by reading it or by running it
like `(sh -x /etc/tcp stop; sh -x /etc/tcp start) > /tmp/tcprestart.sh
2>&1`. (the resulting file is not a valid shell script, it's something
you can use your wits to edit down to a valid shell script).
It is probably the case, with some minor exceptions, that whatever `tcp
stop` and `tcp start` do, they do in more or less oppsite orders, and
you could do only part of them in those same orders. That is, if "stop"
does:
kill `cat /etc/whateverd.pid`
kill `cat /etc/someotherd.pid`
and "start" does:
/etc/someotherd
/etc/whateverd
then you could try stopping and restarting just whateverd; then
whateverd and someotherd, etc. The real action will of course not just
be stopping and starting daemons. You should still be able to mentally
pair stop & start commands.
Some part of the stop will probably amount to "kill any of the following
daemons" out of a list of a dozen or so. To correctly simulate partial
shutdowns of that sort of thing, you'll have to edit the list to only
include the ones you're cycling.
The point of this would be, next time you have the host in a bad state,
take it partway down and partway up, see if that fixes it; repeat, each
time taking it one step farther down, until you determine which step
fixes it. Once you know that you've gained two further things: a point
of leverage for further debugging (maybe someone will know why that
would cause it); and possibly a workaround (maybe the problem subsystem
has a standalone "reset" command; or maybe it isn't even running for a
good reason and you can just eliminate it).
You will probably find that while this is a good theory, in practice you
can't get away with it, some partially shut down states probably won't
come back up cleanly without having gone all the way down. But try it
anyway, you may not have that problem.
Now I think we were originally talking about the backup server, not the
production server. Is this basically the same set of symptoms? If so,
you've achieved a powerful step: you now know how to provoke it on
demand, by removing and reattaching the net cable. So you can probably
intentionally trigger it on the backup server, go through these
debugging steps without disturbing the users.
At that point, users were able to log in again. However, within 3-5 minutes
I was told that they could not print. Lpstat -t showed that the
scheduler was not running. I fixed that by typing /usr/lib/lpsched and the
users were able to print.
lpsched is stopped/started by a different init script than TCP. Pulling
TCP down probably severed some connections it was using, causing it to
crash or shut down semi-gracefully; then nothing knew to start it back
up. So you would want to add to your procedure: `lpshut; tcp stop; tcp
start; lpstart` (if those are the right lp start/stop commands).
However, at the 14:00 midday backup the users were not warned to log off
before the copy, and the rsync copy from the application directories to
/util/backup did not occur. I executed "/etc/cron" to restart cron and
then crontab -e to moved the 14:00 backup to 14:10 (upcoming time) and
the backup started at 14:10 (cron running again).
I wouldn't expect tcp restart to kill cron. But perhaps it did; you
would have to experiment on a live system. If so, you would want to add
deliberate cron shutdown/restart to your repair procedure.
And then hopefully whittle the whole thing down to something simple.
I've not implemented your other suggestions to modify the data (and time)
"netstat -m memory in use" log is updated. I plan to watch the daily
"00:00:00" readings for the next week to see if the idle (backup) server
is still seeing the streams memory leakage with its NIC on IRQ11.
Is it just me or is it odd that pulling the NIC cable would result in
the network crashing a few minutes after the cable is plugged back in?
Driver bugs in the down/up sequences? Some medium-important daemon
crashing or getting into a weird state that took a few minuetes to
really start causing trouble?
Also, why the problem with lpsched and cron not running after the
/etc/tcp stop, /etc/tcp start cycle?
lpsched doesn't surprise me much, cron's more of a mystery.
Bela<.
- Follow-Ups:
- Re: SCO 5.0.7 MP5 network hung up
- From: Steve M. Fabac, Jr.
- Re: SCO 5.0.7 MP5 network hung up
- References:
- Re: SCO 5.0.7 MP5 network hung up
- From: Steve M. Fabac, Jr.
- Re: SCO 5.0.7 MP5 network hung up
- Prev by Date: Re: SCO 5.0.7 MP5 network hung up
- Next by Date: Re: SCO 5.0.7 MP5 network hung up
- Previous by thread: Re: SCO 5.0.7 MP5 network hung up
- Next by thread: Re: SCO 5.0.7 MP5 network hung up
- Index(es):
Relevant Pages
|