Re: Is 5.0.7 ready for production?
From: Bela Lubkin (belal_at_sco.com)
Date: Sun, 7 Sep 2003 08:38:38 GMT To: email@example.com
Mike Brown wrote:
> > > i386ld: Symbol pci-debug in
> > "pci_debug", actually. Which shows that you're typing this rather than
> > cut-and-paste or redirecting to a file, tsk.
> Actually worse then that, it was from memory.
Well in that case you did a pretty good job...
> > > I am testing Progress 9.1b on 5.0.7, with a ProLiant ML530G3. So far
> > > the system has locked up about 6 times with the database running, no
> > > panic dumps as the system is completely locked. The server runs fine
> > > for days without Progress running, locks up in a few hours after
> > > the database is started. No idea why yet.
> > Is this without database activity?? Doesn't the Compaq watchdog stuff
> > kick in and reboot the system? (Not that that would be much better, but
> > it's odd to actually _hang_ a system with a watchdog in it...)
> Spent six hours on that server last night and this morning. I linked in
> the scodb, but after a lockup there is no keyboard input accepted ( not
> even CAPS-LOCK or NUM-LOCK toggle the leds ). Yes, the Compaq watchdog
> does kick in and reboot the server if I wait.
I think there's a way to get the watchdog to trip into scodb rather than
reboot -- some sort of software setting in the "cpqw" driver or
something like that. If not, two other routes would be to install an
NMI card in the machine (or it might have that built in -- so that's two
questions to ask a knowledgable Compaq/HP tech); or try a serial
console. Serial console is really easy. See a recent post from me on
"5.0.7 machine locks up!", where I show how to do a temporary
(single-session) serial console.
> After a reboot I can get
> Progress to start up, but if a wait a while the machine locks instantly
> when the database goes to start. With a bit more debugging I think
> there may be a relationship between a consumption of streams resources
> as reported in "netstat -m" under "streams memory in use" ( SMiU ) and
> the system locking up. With just a root login on tty01 running netstat
> and a graphic login on tty02 all is well. If I fire up mozilla, let
> it bring up the sco home page, then watch the SMiU it slowly
> climbs up. Looks like it will go from ~180k to 4MB in 11 minutes.
That's not exactly "slow" for a resource that is normally sized in the
neighborhood of 4MB.
Is this the Mozilla build that came with OSR507? I haven't heard
anything about it causing STREAMS leaks (and normally it would be
difficult for a user-level program to cause a STREAMS leak).
A leak usually shows up entirely in one particular STREAMS buffer size.
Do you see that -- exceptionally high usage of one size, normal use of
others? (I'm primarily interested in the "alloc" column.)
> Starting Progress at that point, or even using Mozilla to go to more
> web pages instantly locks the system. I repeated the test 3 times.
> If I just bring up Progress ( no graphical login ) the consumption is much
> slower, maybe 2k per minute. The server had frozen up and the Compaq
> watchdog rebooted it during the week, after 35.5 hours. As far as I can
> tell there was very little database use, the Progress server was just brought
> and left running.
> The HW is a ProLiant ML530, single 2.4Ghz Xeon with hyperthreading off,
> 1536MB of ram, and a 6400 raid controller. The NIC is a Broadcom with
> driver version 6.0.129 embedded on the system board. I installed an
> Intel PRO100 card to replace the BCME, which I disabled in the bios
> and in netconfig. There was no change in symptoms, or speed of the
> consumption of SMiU.
> The SW is 5.0.7 with OSS656B and EFS5.60a ( which is current ).
> I updated the system to osr507mp1 and retested, but no change.
> I can ftp or rcp without any problem, copied 8GB to the machine without
> any apparent residual increase in SMiU, but after running mozilla for
> a few minutes the next attempt at copying data piped through a rcmd
> froze the system.
> The machine is not in production at all, it is just for compatibilty
> testing at this point. Any ideas?
We're looking at at least two distinct bugs here, probably three.
Neither Mozilla nor Progress should cause STREAMS leaks; at worst, they
might cause a burst of consumption at startup time, leveling off after
reaching a steady state. There must be a kernel bug which reacts with
some operation they're doing to cause the leak. Then, whatever it is,
it's probably a Mozilla bug that it does it so _vigorously_.
And finally, the system shouldn't lock up when it runs out of STREAMS.
Normally it would produce console warnings, and various things that use
STREAMS (mainly networking) would start failing. At a guess, some
driver has a continual requirement for STREAMS blocks, and mishandles an
error return very badly. Again guessing, from what you've said the
Compaq EFS is the only added kernel code, so it probably contains the
driver that converts a resource crisis into a hang.
You could experiment with linking out various Compaq EFS drivers. I
know some of them can be removed individually, some go in groups. (By
"remove" I mean "turn off in link kit", not actually removing the
software, unless it's too hard to figure out how to turn them off
Is any part of the EFS required for the machine to work? Probably the
RAID driver; anything else? Maybe the thing to do is retract all of the
EFS that you don't absolutely need, then see which parts still persist.
Do Mozilla and Progresss still leak STREAMS? Does the system still hang
when hitting the high water mark?