Re: 5.0.6 grinds to a complete halt

From: Barry Swane (bswane_at_rogers.com)
Date: 05/19/04

  • Next message: Brian: "IBM speaks by memorandum..."
    Date: 19 May 2004 09:13:22 -0700
    
    

    "Barry Swane" <bswane@rogers.com> wrote in message news:<mKgoc.43839$n7P1.28035@twister01.bloor.is.net.cable.rogers.com>...

    It appears I declared victory a little too early.

    > Looks like I have gotten close to the problem here.
    > It is apparently the amirdmon program that is the cause of this mischief.
    > This is for lsil megaraid 2 channerl U320 SCSI RAID controller, with 128mb
    > cache
    > The amirdmon program is the "monitor" program, set up as /etc/amirdmon, to
    > watch for, and report hard drive failures.
    > I killed the program this morning, and performance immediately improved. I
    > called lsil support, and they have sent me amirdmon v 1.05 (to replace the
    > v1.04 that was supplied with the Raid controller)
    > Too soon to tell if all my problems are history, but it does look better.

    Killing the amirdmon process did indeed have salutory effects on the
    performance. Customer stopped reporting noticeable slowness in system
    performance.

    > One disconcerting fact: Before killing the amirdmon, I ran the same job on
    > the new server, and on the 5 year old Acer Altos 9100 (also with RAID 5)
    > server that it replaced. It took 6 times as long on the new server!
    > After killing the amirdmon, I ran the job again-- now it only takes 4 times
    > as long as the old server. Clearly something else is still not correct.

    As noted above, file copy type jobs were still 3-4 times slower than
    the 5 year old server. However, the server did run for a full week,
    before back-sliding yesterday. Again, nothing going on that I can pin
    it on.

    >
    > To answer your questions:
    > System is remote, so I'm not able to observe disk light.
    > I could in fact ping the system, while it was hung
    > Flipping screens on the console did work-- sort of.
    > i.e., user sees the login prompt, he can type his login,
    > and it echoes the characters that are typed.
    > But, then wait forever for password prompt-- never happens.

    I'm now inclined to theorize that Bela's suggestion is correct-- that
    the disk (RAID 5) has stopped responding completely. Would that be
    consistent with the reported behavior? i.e., if you are in a shell,
    you can type characters, and they echo, and you can do a carriage
    return-- but nothing is ever executed?
    Also, this seems major-league weird-- that the system can perform
    absolutely normally, all the time-- except once in a while it loses
    contact with the disk?

    >
    > Thanks for the tip on the debugger. I am optimistic that getting rid of the
    > amirdmon will avoid the hangup again-- if I'm wrong, I will post the results
    > you suggested.

    Some questions re the debugger- which I have now configured.
    If the disk has stopped-- am I likely to get anything back from the
    debugger?
    I assume this can only be run from the system console-- I can't do it
    remotely?
    I imagine that, in order to get info from the debugger, root must
    already be logged in, and sitting at # prompt?
    I am trying to experiment with the debugger in advance of the
    freeze-up, to try to get a little bit familiar with it:
    i) if I hold CTRL-ALT-D - it just logs me out, as if I had pressed
    CTRL-D
    ii) I can load scodb, from shell prompt
        If I enter "stack" command, I get
    When operating on /dev/mem, you cannot examine the stack of the
    current process. The "stack" command must be used with the "-p"
    argument.
    If I enter "stack -p", I get the same message

    Can someone point me to documentation on scodb? man scodb makes
    reference to the SCODB User's Guide. I thought I had a complete set
    of manuals- but I don't have that one.

    Thanks again for any suggestions
    Barry

    > Thanks for your help
    >
    > Barry
    >
    > "Bela Lubkin" <belal@sco.com> wrote in message
    > news:20040507002147.GG10272@sco.com...
    > > Barry Swane wrote:
    > >
    > > > Running sco ose 5.0.6 on Acer Altos G510
    > >
    > > Describe this machine?
    > >
    > > > System runs merrily for anywhere from 10-20 hours
    > > > Then, just stops
    > > > Can't log in
    > > > If a shell is running, it will still echo characters typed, but takes no
    > > > action
    > > > cron jobs dont execute
    > > > only solution is to hard reboot
    > > >
    > > > It seems to me it must be completely pre-occupied doing something, such
    > that
    > > > it totally ignores everything else.
    > > > But, no common thread as to when it dies, or what is running at that
    > point.
    > > >
    > > > Suggestions for how I can attack/diagnose what is going on?
    > >
    > > Any disk activity? This sounds like a hard disk hang. Look for
    > > permanently-on hard disk light. Or permanently-off (but that's less
    > > instructive, you can't tell if the drive is hung or just not being asked
    > > to do anything).
    > >
    > > Can you ping the system from elsewhere?
    > >
    > > Can you flip multiscreens on the console?
    > >
    > > Turn on the kernel debugger ("Y" in /etc/conf/sdevice.d/scodb, relink,
    > > reboot). When ground to a halt, break into scodb (Ctrl-Alt-D on a text
    > > console screen). Give it the command "stack" to get an idea of what's
    > > going on. "q" to quit, then break in again, do the same thing -- see if
    > > it's always doing the same thing. Post one or more sample stack traces
    > > (one of each unique one, if there aren't too many).
    > >
    > > >Bela<
    >
    >
    > ---
    > Outgoing mail is certified Virus Free.
    > Checked by AVG anti-virus system (http://www.grisoft.com).
    > Version: 6.0.673 / Virus Database: 435 - Release Date: 05/05/2004


  • Next message: Brian: "IBM speaks by memorandum..."

    Relevant Pages

    • Re: 5.0.6 grinds to a complete halt
      ... It took 6 times as long on the new server! ... so I'm not able to observe disk light. ... you cannot examine the stack of the ... Can someone point me to documentation on scodb? ...
      (comp.unix.sco.misc)
    • Re: 5.0.6 grinds to a complete halt
      ... >> as long as the old server. ... so I'm not able to observe disk light. ... so you never get to the password prompt. ... > Can someone point me to documentation on scodb? ...
      (comp.unix.sco.misc)
    • hai...find me a solution in M5000 server
      ... series server. ... B B B B B i want to do install one more solaris Over there. ... Solaris cannot see all drives on Areca RAID controller ... popped it (a single disk at this point) into machine1. ...
      (SunManagers)
    • Re: sunmanagers Digest, Vol 22, Issue 12
      ... series server. ... B B B B B i want to do install one more solaris Over there. ... Solaris cannot see all drives on Areca RAID controller ... popped it (a single disk at this point) into machine1. ...
      (SunManagers)
    • Re: Disk Array Usage
      ... > server, and the rest are web, file and app servers. ... I see an array as a massive storage ... > than a disk array, why not have an actual server with a ton of space? ... If you have a number of disks which can be attached to by several machines ...
      (comp.sys.sun.hardware)