Re: SCO 5.0.7 MP5 network hung up
- From: "Steve M. Fabac, Jr." <smfabac@xxxxxxx>
- Date: Mon, 02 Mar 2009 01:44:26 -0600
Bela Lubkin wrote:
Steve Fabac wrote:
In any case, whether this cleans up the leak and whether it awakens theNo.
network are also two separate things. So take note of both: Does it
reduce STREAMS memory allocation?
Does it wake the net back up?
No.
Oh well, it was worth a shot.
I was on-site at the clients on 2/26 to reload 5.0.7 on the primary
machine. During netconfig, I selected the second NIC on the system
board to move the NIC to IRQ11 (previously both systems were installed
using the NIC that exists on the same IRQ (IRQ10) as the Adaptec
aacraid controller.
Long story short, I went around the rack to move the LAN
cable and mistakenly moved the cable on the production server
instead of the server I was reloading. I was immediately called
with the news that everyone's session was locked. I moved the LAN
cable back and people were able to log back in to the system.
However, within 10 minutes, all telnet sessions were terminated
and new telnet connections were being refused.
Feb 26 12:47:27 failover NOTICE: bcme0 (slot:0 port:1): Link is down
Feb 26 12:48:51 failover NOTICE: bcme0 (slot:0 port:1): Link is up (1000Mbps, Full Duplex)
I quickly typed in the above commands as testnet.sh and ran it to create
/tmp/before, during, and after.
So wait a sec -- in this posting you are now asking about a different
(but similar) behavior on the other host? That really serves to confuse
the heck out of the issue.
No, the cable I pulled was on the production system "failover." I was reloading 5.0.7 on
the idle system (host name "real"), the system that would be considered
the backup system while failover was hosting the floating IP and serving
as the production server.
Back in August I installed 5.0.7 on the primary server "real" when it was
received from the distributor. It was put into production and then a
week later the backup server was received (to be named "failover") and
I installed 5.0.7 on it. In the week the primary server was running, it
experienced several panic shutdowns that were diagnosed as caused by the
license update that the current version of 5.0.7 automatically installs
as part of the ISL. I was advised by the distributor's technical support
department that the correct sequence for the ISL is to remove the license
update, install SMP (defer license) then install MP5. Failover was installed
following these recommendations and never experienced the panic condition.
At that time we moved production from "real" to "failover" and have been
running on "failover" since then. All posts in this thread were about
the Jan 19 and February 20 lockup on "failover."
While connected to the "real" server on February 23 and reviewing logs
on both failover and real, I noticed that syslog on real was not logging
and had not logged events since it was last booted on January 20. As I
reviewed /usr/adm/syslog, I found multiple gaps in the syslog record.
The "..." below shows where I omitted contiguous records and the abrupt
changes in the dates below are as found in syslog. On Feb 23, I either
restarted syslogd or issued kill -1 to the syslogd pid and it shows the
syslogd restart at that time.
The firewall forwards remote ssh connections to the floating IP for the production
server. This caused my remote login to be handled by failover and I then
issued "telnet real" and logged in as root. When I noticed that syslog was
not showing any events since Jan 20, I restarted syslogd and then used "ssh
-l smf localhost" as I know that ssh is always logged to syslog. You
can see the evidence that syslogd was running after the restart as it shows
my subsequent ssh connection.
Even though I had re-loaded 5.0.7 on "real" after moving production to
failover in August, the problem with syslog was the reason I returned to the
clients' office on 2/26 to re-install 5.0.7 again on "real" and to move the
network from the NIC on IRQ10 to IRQ11 to get it off IRQ10 used by the Adaptec
2130S RAID controller.
Aug 26 12:19:16 real ntpd[410]: kern_enable is 1
Aug 26 12:19:17 real ntpd[410]: frequency initialized -11.456 from /etc/drift
Aug 26 17:19:17 real sendmail[445]: starting daemon (8.11.0-20030924): SMTP+q
Sep 2 11:21:25 real sshd[10864]: Accepted password for smf from 192.168.160.
Sep 7 20:02:07 real NOTICE: bcme0 (slot:0 port:1): Link is down
Sep 7 20:04:25 real NOTICE: bcme0 (slot:0 port:1): Link is up (1000Mbps, Ful
Sun Sep 7 20:31:53 CDT 2008 shutdown initiated
Sep 7 20:32:01 real ntpd[410]: ntpd exiting on signal 15
Sep 7 20:32:01 real sshd[404]: Received signal 15; terminating.
Sep 7 20:32:08 real noded[323]: Exiting status(5)
Sep 7 20:33:37 real syslogd: restart
....
Oct 9 13:44:02 real rsyncd[7093]: tmp/tm/REGCOMM/quarterly/
Oct 9 13:44:06 real rsyncd[7093]: tmp/tm/REGCOMM/yeartodate/
Oct 9 13:44:17 real rsyncd[7093]: sent 410904 bytes received 8577845 bytes
Dec 31 18:53:20 real ntpd[411]: time reset -0.995221 s
Dec 31 18:53:20 real ntpd[411]: synchronisation lost
Dec 31 19:13:49 real ntpd[411]: time reset 0.448608 s
Dec 31 19:13:49 real ntpd[411]: synchronisation lost
Dec 31 22:32:28 real ntpd[411]: time reset 0.150028 s
Dec 31 22:32:28 real ntpd[411]: synchronisation lost
Tue Jan 20 09:31:26 CST 2009 shutdown initiated
Jan 20 09:31:34 real ntpd[411]: ntpd exiting on signal 15
Jan 20 09:31:34 real sshd[405]: Received signal 15; terminating.
Jan 20 09:31:42 real noded[308]: Exiting status(5)
Jan 20 09:31:42 real rsyncd[6752]: rsync error: received SIGINT, SIGTERM, or
Jan 20 09:33:09 real syslogd: restart
Jan 20 09:33:09 real SCO OpenServer(TM) Release 5
Jan 20 09:33:39 real ntpd[441]: precision = 8 usec
Jan 20 09:33:39 real ntpd[441]: kern_enable is 1
Jan 20 09:33:39 real ntpd[441]: frequency initialized -12.419 from /etc/driftfile
Jan 20 15:33:39 real sendmail[468]: starting daemon (8.11.0-20030924): SMTP+queueing@01:00:00
Jan 20 15:33:41 real lmail[315]: Cannot open /usr/spool/mail/test1: Operation would block
Jan 20 15:33:41 real lmail[308]: Cannot open /usr/spool/mail/test1: Operation would block
Jan 20 15:33:41 real lmail[310]: Cannot open /usr/spool/mail/test1: Operation would block
Feb 23 09:00:04 real syslogd: re-initialise
Feb 23 09:00:04 real syslogd: restart
Feb 23 09:05:49 real sshd[2677]: Accepted password for smf from 127.0.0.1 port 1485 ssh2
$ ls -lt /tmp/before /tmp/during /tmp/after
-rw-rw-r-- 1 root sys 6900 Feb 26 13:01 /tmp/after
-rw-rw-r-- 1 root sys 6900 Feb 26 13:01 /tmp/during
-rw-rw-r-- 1 root sys 6900 Feb 26 13:00 /tmp/before
The following is the diff3 results of the before, during and after
log files (I replaced numbers that did not change in the counts
with "+" to make it easer to see the changes):
I actually felt that blotting out digits of single numbers made it
harder, not easier to compare. i.e. if it said "1234 ... 1239' that
would be easier to read than "+++4 ... +++9". Blotting out entire
entries that don't change seems sort of reasonable, but even that isn't
really necessary.
I had tried to make it easer to quickly find the relevant changes by masking
the numbers that did not change. True that your example above of 1234 1239
by itself on a line is easy to compare. but a page with columns of numbers
is a different case. It's much easer for me to identify where the changes are and
to compute the magnitude of the changes when you don't have to deal with the
other trees in the forest.
I appreciate your taking the time to look at this and I tried to make the
task of analyzing the numbers as easy as possible in a way that I had found
to be useful in the past.
Anyway, you could have replaced all of this with "none of the numbers
changed by any material amount".
That goes to experience: If I made that statement, right or wrong, I would
be pre judging what you might make of the data.
The net0 down/up sequence in testnet.sh did not correct the network lock up.
The system was still inaccessible via the network.
That suggests a bug in the Broadcom driver (bcme). Though it's hard to
say.
The following results from ndstat -l seems to show that the NIC is receiving
packets all the time the telnet connections were being refused. The change
from 6712 (before and during) to 14325 (after) seems to indicate that traffic
out of the NIC was enabled by ifconfig net0 down / up cycle.
FRAMES
Unicast Multicast Broadcast Error Octets Queue Length
---------- --------- --------- ------ ----------- ------------
1:61,62c
In: ++++++18 0 +++++05 0 ++++496164 0
Out: ++++++23 0 589 0 ++++06712 0
2:61,62c
In: ++++++18 0 +++++33 0 ++++499100 0
Out: ++++++23 0 589 0 ++++06712 0
3:61,62c
In: ++++++27 0 +++++56 0 ++++502462 0
Out: ++++++34 0 589 0 ++++14325 0
Is it accurate to say that the numbers in the netstat -m "total" column
increasing in both the during and after tests show that data is being
received by the NIC and being handed off to the next level?
I was able to get the network back up by executing "/etc/tcp stop"
followed by "/etc/tcp start."
That's an interesting observation. You can use it: /etc/tcp is just a
shell script. You can extract, either by reading it or by running it
like `(sh -x /etc/tcp stop; sh -x /etc/tcp start) > /tmp/tcprestart.sh
2>&1`. (the resulting file is not a valid shell script, it's something
you can use your wits to edit down to a valid shell script).
It is probably the case, with some minor exceptions, that whatever `tcp
stop` and `tcp start` do, they do in more or less oppsite orders, and
you could do only part of them in those same orders. That is, if "stop"
does:
kill `cat /etc/whateverd.pid`
kill `cat /etc/someotherd.pid`
and "start" does:
/etc/someotherd
/etc/whateverd
then you could try stopping and restarting just whateverd; then
whateverd and someotherd, etc. The real action will of course not just
be stopping and starting daemons. You should still be able to mentally
pair stop & start commands.
Some part of the stop will probably amount to "kill any of the following
daemons" out of a list of a dozen or so. To correctly simulate partial
shutdowns of that sort of thing, you'll have to edit the list to only
include the ones you're cycling.
The point of this would be, next time you have the host in a bad state,
take it partway down and partway up, see if that fixes it; repeat, each
time taking it one step farther down, until you determine which step
fixes it. Once you know that you've gained two further things: a point
of leverage for further debugging (maybe someone will know why that
would cause it); and possibly a workaround (maybe the problem subsystem
has a standalone "reset" command; or maybe it isn't even running for a
good reason and you can just eliminate it).
You will probably find that while this is a good theory, in practice you
can't get away with it, some partially shut down states probably won't
come back up cleanly without having gone all the way down. But try it
anyway, you may not have that problem.
Now I think we were originally talking about the backup server, not the
production server. Is this basically the same set of symptoms? If so,
you've achieved a powerful step: you now know how to provoke it on
demand, by removing and reattaching the net cable. So you can probably
intentionally trigger it on the backup server, go through these
debugging steps without disturbing the users.
That might be a useful exercise (attempting to provoke a lock up on the backup
server). Trying to isolate the magic solution in tcp start on the production
server while 30 users are trying to take orders and ship product will not
be well received.
At that point, users were able to log in again. However, within 3-5 minutes
I was told that they could not print. Lpstat -t showed that the
scheduler was not running. I fixed that by typing /usr/lib/lpsched and the
users were able to print.
lpsched is stopped/started by a different init script than TCP. Pulling
TCP down probably severed some connections it was using, causing it to
crash or shut down semi-gracefully; then nothing knew to start it back
up. So you would want to add to your procedure: `lpshut; tcp stop; tcp
start; lpstart` (if those are the right lp start/stop commands).
However, at the 14:00 midday backup the users were not warned to log off
before the copy, and the rsync copy from the application directories to
/util/backup did not occur. I executed "/etc/cron" to restart cron and
then crontab -e to moved the 14:00 backup to 14:10 (upcoming time) and
the backup started at 14:10 (cron running again).
I wouldn't expect tcp restart to kill cron. But perhaps it did; you
would have to experiment on a live system. If so, you would want to add
deliberate cron shutdown/restart to your repair procedure.
To be honest, I did not know to check on lpsched or cron when the lockup
cause by un-plugging and plugging the cable occurred. I only found that they
were affected when I was able to get users back on by the tcp start command.
Interesting to look back at the log for netstat -m: It appears to show that
cron is affected by the lockup conditions in the past:
Mon Jan 19 15:45:00 CST 2009 streams memory in use: 13359.90KB
Mon Jan 19 15:50:00 CST 2009 streams memory in use: 13359.07KB
Mon Jan 19 15:55:00 CST 2009 streams memory in use: 13357.46KB
Mon Jan 19 16:00:00 CST 2009 streams memory in use: 13357.68KB
Mon Jan 19 16:20:00 CST 2009 streams memory in use: 746.99KB
Mon Jan 19 16:25:00 CST 2009 streams memory in use: 1639.61KB
Mon Jan 19 16:30:00 CST 2009 streams memory in use: 1734.93KB
There are three missing samples 16:05, 16:10, and 16:15 with the system
rebooted at 16:16. So the un-provoked network lockup also killed cron
or at least prevented the job logging netstat -m from running.
Jan 19 14:59:07 failover ftpd[14901]: FTP LOGIN REFUSED (ftp not in /etc/passwd)
Jan 19 15:38:37 failover ftpd[18510]: FTP LOGIN REFUSED (ftp not in /etc/passwd)
Nulls deleted smf 1/19/09
Jan 19 16:16:20 failover syslogd: restart
Jan 19 16:16:20 failover SCO OpenServer(TM) Release 5
And looking at /usr/spool/lp/logs/requests:
grep "Jan 19" requests | less
= ReformPRNbak-99766, uid 213, gid 100, size 2393, Mon Jan 19 16:02:58 2009
= ReformPRNbak-99767, uid 213, gid 100, size 2391, Mon Jan 19 16:02:58 2009
= sharppurch-99765, uid 264, gid 100, size 77012, Mon Jan 19 16:02:59 2009
= ReformPRNbak-99768, uid 213, gid 100, size 2391, Mon Jan 19 16:02:58 2009
= ReformPRNbak-99769, uid 213, gid 100, size 2391, Mon Jan 19 16:02:58 2009
= ReformPRNbak-99770, uid 213, gid 100, size 2387, Mon Jan 19 16:02:59 2009
= lexrec-99775, uid 209, gid 100, size 5116, Mon Jan 19 16:18:07 2009
= lexrec-99776, uid 209, gid 100, size 23600, Mon Jan 19 16:18:08 2009
= ReformPRNbak-99777, uid 209, gid 100, size 29940, Mon Jan 19 16:18:11 2009
= sharp-99778, uid 267, gid 100, size 798, Mon Jan 19 16:18:14 2009
= ReformPRNbak-99780, uid 213, gid 100, size 2393, Mon Jan 19 16:19:04 2009
= ReformPRNbak-99781, uid 213, gid 100, size 2391, Mon Jan 19 16:19:04 2009
= ReformPRNbak-99782, uid 213, gid 100, size 2387, Mon Jan 19 16:19:04 2009
Shows jobs 99771 - 99774 lost during the time the network was locked up.
All printers are network printers using netcat or rlpr.
Also found this in /usr/spool/lp/temp (many more but I just picked the
Jan 19 to Jan 20 ones:
-rw------- 1 jennie lp 0 Jan 20 08:30 100082-0
-rw-rw---- 1 jennie lp 0 Jan 20 08:30 100082-1
-rw------- 1 team13 lp 0 Jan 20 06:04 99930-0
-rw------- 1 root lp 0 Jan 19 16:04 99772
-rw-rw---- 1 lp lp 109 Jan 19 16:04 99772-0
-rw------- 1 judy2 lp 159613 Jan 19 16:04 99772-1
-rw------- 1 craig2 lp 0 Jan 19 16:03 99774-0
-rw-rw---- 1 craig2 lp 0 Jan 19 16:03 99774-1
-rw------- 1 daves lp 0 Jan 19 15:55 99741-0
-rw-rw---- 1 daves lp 0 Jan 19 15:55 99741-1
-rw------- 1 jennie lp 0 Jan 19 15:36 99687-0
-rw-rw---- 1 jennie lp 0 Jan 19 15:36 99687-1
-rw-rw---- 1 team13 lp 10240 Jan 19 15:28 99444-1
-rw------- 1 team13 lp 0 Jan 19 14:20 99444-0
# ls -lt *-1 | wc -l
84
-rw-rw---- 1 team13 lp 40960 Feb 28 18:01 128592-1
-rw-rw---- 1 jennie lp 1024 Feb 26 14:41 126788-1
....
-rw-rw---- 1 jennie lp 0 Aug 20 2008 2771-1
-rw-rw---- 1 jennie lp 1024 Aug 19 2008 1799-1
Guess I need to talk to the client about their printing: 84 hung
jobs out of 126793 jobs from Aug 19 to Feb 28. Lpstat -t does not
show any of these jobs.
And then hopefully whittle the whole thing down to something simple.
I've not implemented your other suggestions to modify the data (and time)
"netstat -m memory in use" log is updated. I plan to watch the daily
"00:00:00" readings for the next week to see if the idle (backup) server
is still seeing the streams memory leakage with its NIC on IRQ11.
Is it just me or is it odd that pulling the NIC cable would result in
the network crashing a few minutes after the cable is plugged back in?
Driver bugs in the down/up sequences? Some medium-important daemon
crashing or getting into a weird state that took a few minuetes to
really start causing trouble?
Also, why the problem with lpsched and cron not running after the
/etc/tcp stop, /etc/tcp start cycle?
lpsched doesn't surprise me much, cron's more of a mystery.
Bela<
--
Steve Fabac
S.M. Fabac & Associates
816/765-1670
.
- References:
- Re: SCO 5.0.7 MP5 network hung up
- From: Steve M. Fabac, Jr.
- Re: SCO 5.0.7 MP5 network hung up
- From: Bela Lubkin
- Re: SCO 5.0.7 MP5 network hung up
- Prev by Date: Re: SCO 5.0.7 MP5 network hung up
- Next by Date: Re: auditmask gone - how to recreate or should i bother
- Previous by thread: Re: SCO 5.0.7 MP5 network hung up
- Next by thread: Re: auditmask gone - how to recreate or should i bother
- Index(es):
Relevant Pages
|
Loading