Re: 5.0.6 grinds to a complete halt

From: Barry Swane (bswane_at_rogers.com)
Date: 05/20/04


Date: Thu, 20 May 2004 13:26:03 GMT

Bela, thanks so much for your clear and concise explanation of how a
disk-susbsystem hang would reflect in how ose 5.0.6 behaved. I thought I
was going (gone?) nuts, when it would echo characters, but do nothing. (I
actually left my commands "running" for 8 hours at one point, figuring that
SOMETHING would finally happen-- not)
Also, thanks for all the additional info you have given on the use of the
debugger.
Now-- at the risk of wearing out my welcome here-- can you give me any
direction as to what I might be able to determine, with the debugger, to
confirm the theory that it is the disk subsystem that is hanging? I would
have thought there would have been some sort of error message, or timeout,
if it just hangs. Clearly that is not the case. Would I look for something
like a process that I can hopefully identify as disk IO, just sitting there
watching?
Thanks again
Barry

"Bela Lubkin" <belal@sco.com> wrote in message
news:20040520080340.GS10272@sco.com...
> Barry Swane wrote:
>
> > It appears I declared victory a little too early.
>
> > Killing the amirdmon process did indeed have salutory effects on the
> > performance. Customer stopped reporting noticeable slowness in system
> > performance.
> >
> > > One disconcerting fact: Before killing the amirdmon, I ran the same
job on
> > > the new server, and on the 5 year old Acer Altos 9100 (also with RAID
5)
> > > server that it replaced. It took 6 times as long on the new server!
> > > After killing the amirdmon, I ran the job again-- now it only takes 4
times
> > > as long as the old server. Clearly something else is still not
correct.
> >
> > As noted above, file copy type jobs were still 3-4 times slower than
> > the 5 year old server. However, the server did run for a full week,
> > before back-sliding yesterday. Again, nothing going on that I can pin
> > it on.
> >
> > > To answer your questions:
> > > System is remote, so I'm not able to observe disk light.
> > > I could in fact ping the system, while it was hung
> > > Flipping screens on the console did work-- sort of.
> > > i.e., user sees the login prompt, he can type his login,
> > > and it echoes the characters that are typed.
> > > But, then wait forever for password prompt-- never happens.
> >
> > I'm now inclined to theorize that Bela's suggestion is correct-- that
> > the disk (RAID 5) has stopped responding completely. Would that be
> > consistent with the reported behavior? i.e., if you are in a shell,
> > you can type characters, and they echo, and you can do a carriage
> > return-- but nothing is ever executed?
>
> Perfectly consistent. OpenServer is very conservative about swapping;
> it never pushes process pages out to swap unless it's out of memory. On
> modern systems this generally means that swap is never touched. Thus,
> any active process resides entirely in memory. Also, the kernel itself
> is all hard-loaded in RAM -- none of it is pagable. If the disk
> subsystem hangs, the kernel continues to function. Each individual
> process continues to function until the first time it tries to access
> the disk.
>
> For instance, the program that provides the login prompt (`getty`, for
> console ttys) will continue to accept and echo characters. If you hit
> return on a name, it goes to exec `login`, which involves disk access,
> so you never get to the password prompt.
>
> If you're sitting at a shell prompt, you can type; you can run internal
> commands like "echo foo"; but any attempt to run a binary will hang.
> (Even if the binary is fully cached, its access time needs to be updated
> on disk.)
>
> > Also, this seems major-league weird-- that the system can perform
> > absolutely normally, all the time-- except once in a while it loses
> > contact with the disk?
>
> It isn't particularly weird. What you're describing is a fairly
> standard set of symptoms for a variety of conditions including SCSI bus
> timing, parity or signal integrity problems; internal errors in a disk
> drive; and so on. You might rightly expect a RAID controller to be a
> bit more thorough about error recovery, but apparently this particular
> one -- in this particular failure case, whatever it is -- isn't.
>
> You also mischaracterizze the situation here. It _isn't_ performing
> absolutely normally. It's running 6 times slower than older and
> presumably much slower machines.
>
> But I bet the two symptoms are actually unrelated, and you have two
> separate problems to solve. (1) complex application jobs run much more
> slowly than expected; (2) the disk subsystem occasionally hangs.
>
> > > Thanks for the tip on the debugger. I am optimistic that getting rid
of the
> > > amirdmon will avoid the hangup again-- if I'm wrong, I will post the
results
> > > you suggested.
> >
> > Some questions re the debugger- which I have now configured.
> > If the disk has stopped-- am I likely to get anything back from the
> > debugger?
> > I assume this can only be run from the system console-- I can't do it
> > remotely?
> > I imagine that, in order to get info from the debugger, root must
> > already be logged in, and sitting at # prompt?
> > I am trying to experiment with the debugger in advance of the
> > freeze-up, to try to get a little bit familiar with it:
> > i) if I hold CTRL-ALT-D - it just logs me out, as if I had pressed
> > CTRL-D
> > ii) I can load scodb, from shell prompt
> > If I enter "stack" command, I get
> > When operating on /dev/mem, you cannot examine the stack of the
> > current process. The "stack" command must be used with the "-p"
> > argument.
> > If I enter "stack -p", I get the same message
> >
> > Can someone point me to documentation on scodb? man scodb makes
> > reference to the SCODB User's Guide. I thought I had a complete set
> > of manuals- but I don't have that one.
>
> These are good questions... I'll post a second reply as a separate
> subthread, because I'm going to include some research results that are
> worth archiving permanently under a sensible subject line.
>
> >Bela<

---
Outgoing mail is certified Virus Free.
Checked by AVG anti-virus system (http://www.grisoft.com).
Version: 6.0.683 / Virus Database: 445 - Release Date: 17/05/2004


Relevant Pages

  • Re: Windows Hangs Intermittenly | Strange Problem | Tried Everything | Desperately need some help
    ... Disk Read Error Press Ctl+Alt+Del to Restart ... I am using a Dell Inspiron Laptop with Windows XP Home edition. ... But since last two weeks or so the laptop hangs intermittently. ... AUTOCHK is not available for RAW Drives ...
    (microsoft.public.windowsxp.general)
  • Re: 5.0.6 grinds to a complete halt
    ... It took 6 times as long on the new server! ... so I'm not able to observe disk light. ... you cannot examine the stack of the ... Can someone point me to documentation on scodb? ...
    (comp.unix.sco.misc)
  • Re: 5.0.6 grinds to a complete halt
    ... It took 6 times as long on the new server! ... so I'm not able to observe disk light. ... you cannot examine the stack of the ... Can someone point me to documentation on scodb? ...
    (comp.unix.sco.misc)
  • Re: Toast wont write DVD+RW
    ... > resulting disk is corrupt (it mounts, all the files are there, you can ... > If I erase the media with Toast, it becomes "unrecordable," although I ... > Toast hangs writing the lead-out, ... "Are You Berry Berry Happy?" ...
    (comp.sys.mac.apps)
  • Re: Anti Virus scans hang my PC
    ... | If this is not the correct group, please point me at the correct one. ... | When I try doing a virus scan of any size, my PC hangs. ... EMSA disk check also completes. ... Western Digital - Data LifeGuard Tools ...
    (microsoft.public.security.virus)