Re: 5.0.6 grinds to a complete halt
From: Barry Swane (bswane_at_rogers.com)
Date: 05/20/04
- Next message: IanC: "Re: Hard Disk Performance"
- Previous message: Tony Lawrence: "Re: Self defense for SCO users"
- In reply to: Bela Lubkin: "Re: 5.0.6 grinds to a complete halt"
- Next in thread: Bela Lubkin: "sniffing out a hung disk subsystem, Re: 5.0.6 grinds to a complete halt"
- Reply: Bela Lubkin: "sniffing out a hung disk subsystem, Re: 5.0.6 grinds to a complete halt"
- Reply: Bob Bailin: "Re: 5.0.6 grinds to a complete halt"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Date: Thu, 20 May 2004 13:26:03 GMT
Bela, thanks so much for your clear and concise explanation of how a
disk-susbsystem hang would reflect in how ose 5.0.6 behaved. I thought I
was going (gone?) nuts, when it would echo characters, but do nothing. (I
actually left my commands "running" for 8 hours at one point, figuring that
SOMETHING would finally happen-- not)
Also, thanks for all the additional info you have given on the use of the
debugger.
Now-- at the risk of wearing out my welcome here-- can you give me any
direction as to what I might be able to determine, with the debugger, to
confirm the theory that it is the disk subsystem that is hanging? I would
have thought there would have been some sort of error message, or timeout,
if it just hangs. Clearly that is not the case. Would I look for something
like a process that I can hopefully identify as disk IO, just sitting there
watching?
Thanks again
Barry
"Bela Lubkin" <belal@sco.com> wrote in message
news:20040520080340.GS10272@sco.com...
> Barry Swane wrote:
>
> > It appears I declared victory a little too early.
>
> > Killing the amirdmon process did indeed have salutory effects on the
> > performance. Customer stopped reporting noticeable slowness in system
> > performance.
> >
> > > One disconcerting fact: Before killing the amirdmon, I ran the same
job on
> > > the new server, and on the 5 year old Acer Altos 9100 (also with RAID
5)
> > > server that it replaced. It took 6 times as long on the new server!
> > > After killing the amirdmon, I ran the job again-- now it only takes 4
times
> > > as long as the old server. Clearly something else is still not
correct.
> >
> > As noted above, file copy type jobs were still 3-4 times slower than
> > the 5 year old server. However, the server did run for a full week,
> > before back-sliding yesterday. Again, nothing going on that I can pin
> > it on.
> >
> > > To answer your questions:
> > > System is remote, so I'm not able to observe disk light.
> > > I could in fact ping the system, while it was hung
> > > Flipping screens on the console did work-- sort of.
> > > i.e., user sees the login prompt, he can type his login,
> > > and it echoes the characters that are typed.
> > > But, then wait forever for password prompt-- never happens.
> >
> > I'm now inclined to theorize that Bela's suggestion is correct-- that
> > the disk (RAID 5) has stopped responding completely. Would that be
> > consistent with the reported behavior? i.e., if you are in a shell,
> > you can type characters, and they echo, and you can do a carriage
> > return-- but nothing is ever executed?
>
> Perfectly consistent. OpenServer is very conservative about swapping;
> it never pushes process pages out to swap unless it's out of memory. On
> modern systems this generally means that swap is never touched. Thus,
> any active process resides entirely in memory. Also, the kernel itself
> is all hard-loaded in RAM -- none of it is pagable. If the disk
> subsystem hangs, the kernel continues to function. Each individual
> process continues to function until the first time it tries to access
> the disk.
>
> For instance, the program that provides the login prompt (`getty`, for
> console ttys) will continue to accept and echo characters. If you hit
> return on a name, it goes to exec `login`, which involves disk access,
> so you never get to the password prompt.
>
> If you're sitting at a shell prompt, you can type; you can run internal
> commands like "echo foo"; but any attempt to run a binary will hang.
> (Even if the binary is fully cached, its access time needs to be updated
> on disk.)
>
> > Also, this seems major-league weird-- that the system can perform
> > absolutely normally, all the time-- except once in a while it loses
> > contact with the disk?
>
> It isn't particularly weird. What you're describing is a fairly
> standard set of symptoms for a variety of conditions including SCSI bus
> timing, parity or signal integrity problems; internal errors in a disk
> drive; and so on. You might rightly expect a RAID controller to be a
> bit more thorough about error recovery, but apparently this particular
> one -- in this particular failure case, whatever it is -- isn't.
>
> You also mischaracterizze the situation here. It _isn't_ performing
> absolutely normally. It's running 6 times slower than older and
> presumably much slower machines.
>
> But I bet the two symptoms are actually unrelated, and you have two
> separate problems to solve. (1) complex application jobs run much more
> slowly than expected; (2) the disk subsystem occasionally hangs.
>
> > > Thanks for the tip on the debugger. I am optimistic that getting rid
of the
> > > amirdmon will avoid the hangup again-- if I'm wrong, I will post the
results
> > > you suggested.
> >
> > Some questions re the debugger- which I have now configured.
> > If the disk has stopped-- am I likely to get anything back from the
> > debugger?
> > I assume this can only be run from the system console-- I can't do it
> > remotely?
> > I imagine that, in order to get info from the debugger, root must
> > already be logged in, and sitting at # prompt?
> > I am trying to experiment with the debugger in advance of the
> > freeze-up, to try to get a little bit familiar with it:
> > i) if I hold CTRL-ALT-D - it just logs me out, as if I had pressed
> > CTRL-D
> > ii) I can load scodb, from shell prompt
> > If I enter "stack" command, I get
> > When operating on /dev/mem, you cannot examine the stack of the
> > current process. The "stack" command must be used with the "-p"
> > argument.
> > If I enter "stack -p", I get the same message
> >
> > Can someone point me to documentation on scodb? man scodb makes
> > reference to the SCODB User's Guide. I thought I had a complete set
> > of manuals- but I don't have that one.
>
> These are good questions... I'll post a second reply as a separate
> subthread, because I'm going to include some research results that are
> worth archiving permanently under a sensible subject line.
>
> >Bela<
--- Outgoing mail is certified Virus Free. Checked by AVG anti-virus system (http://www.grisoft.com). Version: 6.0.683 / Virus Database: 445 - Release Date: 17/05/2004
- Next message: IanC: "Re: Hard Disk Performance"
- Previous message: Tony Lawrence: "Re: Self defense for SCO users"
- In reply to: Bela Lubkin: "Re: 5.0.6 grinds to a complete halt"
- Next in thread: Bela Lubkin: "sniffing out a hung disk subsystem, Re: 5.0.6 grinds to a complete halt"
- Reply: Bela Lubkin: "sniffing out a hung disk subsystem, Re: 5.0.6 grinds to a complete halt"
- Reply: Bob Bailin: "Re: 5.0.6 grinds to a complete halt"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Relevant Pages
|