Re: 5.0.6 grinds to a complete halt
From: Bela Lubkin (belal_at_sco.com)
Date: 05/20/04
- Next message: Bela Lubkin: "using scodb for various purposes, Re: 5.0.6 grinds to a complete halt"
- Previous message: Bela Lubkin: "Re: VMWare/ Mouse"
- In reply to: Barry Swane: "Re: 5.0.6 grinds to a complete halt"
- Next in thread: Barry Swane: "Re: 5.0.6 grinds to a complete halt"
- Reply: Barry Swane: "Re: 5.0.6 grinds to a complete halt"
- Reply: Barry Swane: "Re: 5.0.6 grinds to a complete halt"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Date: Thu, 20 May 2004 08:03:40 GMT To: scomsc@xenitec.ca
Barry Swane wrote:
> It appears I declared victory a little too early.
> Killing the amirdmon process did indeed have salutory effects on the
> performance. Customer stopped reporting noticeable slowness in system
> performance.
>
> > One disconcerting fact: Before killing the amirdmon, I ran the same job on
> > the new server, and on the 5 year old Acer Altos 9100 (also with RAID 5)
> > server that it replaced. It took 6 times as long on the new server!
> > After killing the amirdmon, I ran the job again-- now it only takes 4 times
> > as long as the old server. Clearly something else is still not correct.
>
> As noted above, file copy type jobs were still 3-4 times slower than
> the 5 year old server. However, the server did run for a full week,
> before back-sliding yesterday. Again, nothing going on that I can pin
> it on.
>
> > To answer your questions:
> > System is remote, so I'm not able to observe disk light.
> > I could in fact ping the system, while it was hung
> > Flipping screens on the console did work-- sort of.
> > i.e., user sees the login prompt, he can type his login,
> > and it echoes the characters that are typed.
> > But, then wait forever for password prompt-- never happens.
>
> I'm now inclined to theorize that Bela's suggestion is correct-- that
> the disk (RAID 5) has stopped responding completely. Would that be
> consistent with the reported behavior? i.e., if you are in a shell,
> you can type characters, and they echo, and you can do a carriage
> return-- but nothing is ever executed?
Perfectly consistent. OpenServer is very conservative about swapping;
it never pushes process pages out to swap unless it's out of memory. On
modern systems this generally means that swap is never touched. Thus,
any active process resides entirely in memory. Also, the kernel itself
is all hard-loaded in RAM -- none of it is pagable. If the disk
subsystem hangs, the kernel continues to function. Each individual
process continues to function until the first time it tries to access
the disk.
For instance, the program that provides the login prompt (`getty`, for
console ttys) will continue to accept and echo characters. If you hit
return on a name, it goes to exec `login`, which involves disk access,
so you never get to the password prompt.
If you're sitting at a shell prompt, you can type; you can run internal
commands like "echo foo"; but any attempt to run a binary will hang.
(Even if the binary is fully cached, its access time needs to be updated
on disk.)
> Also, this seems major-league weird-- that the system can perform
> absolutely normally, all the time-- except once in a while it loses
> contact with the disk?
It isn't particularly weird. What you're describing is a fairly
standard set of symptoms for a variety of conditions including SCSI bus
timing, parity or signal integrity problems; internal errors in a disk
drive; and so on. You might rightly expect a RAID controller to be a
bit more thorough about error recovery, but apparently this particular
one -- in this particular failure case, whatever it is -- isn't.
You also mischaracterizze the situation here. It _isn't_ performing
absolutely normally. It's running 6 times slower than older and
presumably much slower machines.
But I bet the two symptoms are actually unrelated, and you have two
separate problems to solve. (1) complex application jobs run much more
slowly than expected; (2) the disk subsystem occasionally hangs.
> > Thanks for the tip on the debugger. I am optimistic that getting rid of the
> > amirdmon will avoid the hangup again-- if I'm wrong, I will post the results
> > you suggested.
>
> Some questions re the debugger- which I have now configured.
> If the disk has stopped-- am I likely to get anything back from the
> debugger?
> I assume this can only be run from the system console-- I can't do it
> remotely?
> I imagine that, in order to get info from the debugger, root must
> already be logged in, and sitting at # prompt?
> I am trying to experiment with the debugger in advance of the
> freeze-up, to try to get a little bit familiar with it:
> i) if I hold CTRL-ALT-D - it just logs me out, as if I had pressed
> CTRL-D
> ii) I can load scodb, from shell prompt
> If I enter "stack" command, I get
> When operating on /dev/mem, you cannot examine the stack of the
> current process. The "stack" command must be used with the "-p"
> argument.
> If I enter "stack -p", I get the same message
>
> Can someone point me to documentation on scodb? man scodb makes
> reference to the SCODB User's Guide. I thought I had a complete set
> of manuals- but I don't have that one.
These are good questions... I'll post a second reply as a separate
subthread, because I'm going to include some research results that are
worth archiving permanently under a sensible subject line.
>Bela<
- Next message: Bela Lubkin: "using scodb for various purposes, Re: 5.0.6 grinds to a complete halt"
- Previous message: Bela Lubkin: "Re: VMWare/ Mouse"
- In reply to: Barry Swane: "Re: 5.0.6 grinds to a complete halt"
- Next in thread: Barry Swane: "Re: 5.0.6 grinds to a complete halt"
- Reply: Barry Swane: "Re: 5.0.6 grinds to a complete halt"
- Reply: Barry Swane: "Re: 5.0.6 grinds to a complete halt"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Relevant Pages
|