Re: 5.0.7 machine locks up!

From: Bela Lubkin (belal_at_sco.com)
Date: 09/06/03


Date: Sat, 6 Sep 2003 09:41:40 GMT
To: scomsc@xenitec.ca

Brian Lavender wrote:

> >> I just put into production my new OSR 5.0.7 machine running an
> >> Orthodontist application, and I have experienced several times where
> >> the system became unresponsive and I had to do power cycle on it
> >> (Ouch). I tried a terminal login, a virtual console, and even a telnet
> >> login. All I had was one shell that responded, but when I tried to
> >> switch to the root user, it would become unresponsive. Where can I
> >> look to find out what caused this?

> Here's what the machine has attached to it.
>
> 2 telnet logins
> 2 serial based terminal logins
> 1 login is through the Digi Terminal Server
> 1 login is through the tty1a
> The console with three virtual consoles
>
> The telnet and serial based terminals became totally unresponsive. I
> had one console on tty02 in an existing shell where I could type
> $ w
> and it would respond. I could type a few other commands as well. On
> the other virtual consoles, if I logged out, I would get a login:
> prompt. I could type in the user name, but then I would get no
> password: prompt. I do believe that after a waiting a long time, I was
> able to get a password prompt. Then I did get a # prompt. I typed
> # init 6
> but it wouldn't go into reboot. The result of the w command showed
> zero load. I did a power cycle on the box, and after rebooting, I
> checked syslog and messages. I couldn't see nothing that resulted in
> the problem.

This could very well be a disk hang. Next time it happens, watch the
disk light very carefully while provoking as much response as you can
still get (e.g. running `w`, getting to login prompts on multiscreens,
typing username, hitting return, no password prompt; etc.) If there is
no disk light activity (remains stuck off or on), it's probably a disk
hang. Then, when rebooting, watch the light again just to be sure that
it works under normal conditions (it wouldn't do to blame the disk when
actually the cable to the LED wasn't connected...)

Then, check all disk cabling and related areas. If it's SCSI, go into
the HBA's BIOS setup, make sure it's configured sanely.

> The Digi Terminal Server doesn't connect via PCI or ISA. It sits on
> the network and uses a driver to make the serial ports look as if they
> are local.
>
> A friend suggested I look at ps and see if there is a process that has
> a blocked or waiting interrupt. He also suggested looking at lsof.

Something that causes all terminal processes on the system to become
unresponsive is unlikely to be caused by a process. But do look at `ps
-ef` if you can. Column 4 (labeled "C") is an indicator of recent CPU
usage. If you see one or more processes that show 80 (the max value),
they're probably spinning consuming CPU. But normally that would only
slow down a system, not stop it.

> The one thing I do know is that one of the serial based terminals
> shows names of patients who are scheduled to arrive. The patients
> normally check themselves in. If the receptionist checks in a person
> in a person instead, the program normally removes the patient, and
> updates the patient login screen. The tty for the patient checkin is
> writeable by other users. I am thinking that maybe there is some
> process that has or is waiting for something to come available, and it
> is causing the system to block. There doesn't seem to be any specific
> conditions that causes this lockup.

This doesn't sound like a probable cause. Though I must say I still
don't particularly understand the description.

> Any suggestions on how to troubleshoot this?

Look into the disk stuff. Observe the disk lights; also run the
commands I asked about earlier (dparam etc.).

Link scodb into the kernel, break into the kernel debugger when the
system hangs, look around. If you set up a serial console (easy, you've
already got tty1a active), you can capture scodb sessions nicely. To do
a temporary serial console, get to the boot prompt and type "systty=1".
The boot prompt will move to tty1a == COM1 (9600/8/N/1). Have a PC
waiting there with "capture" turned on. Control-X is the "enter scodb"
command on a serial console. Once the machine hangs, hit ^X on the
serial console, then:

  scodb> stack
  scodb> ps()

There's much more you can do, but that's a decent start. One other
thing you can do is capture a system image:

  scodb> sysdump()
  scodb> reboot()

(the machine reboots as if you'd hit the power switch). Now go to
single-user mode and run:

  sysdump -i /dev/swap -fu -o - | bzip2 > /tmp/dump.bz2

Save that, I might want to look at it.

>Bela<



Relevant Pages

  • Re: New HDD Installation
    ... I entered the repair console and pressed F6 ... while your installation is up and running use the Disk Management ... Description of the Windows XP Recovery Console ... if it gets the new HDD in and the small out. ...
    (microsoft.public.windowsxp.general)
  • Re: New HDD Installation
    ... Incidentally I think that the reason that the disc doesn't want to boot is because it doesn't have a proper Master Boot Record, it wasn't done by the cloning operation. ... If running the commands doesn't fix the problem then you can use the F10 option to install Windows and do an In-Place Upgrade, or what is more commonly called a "Repair Install". ... The results of our earlier test in the Disk Management console were not conclusive, was the option to do so there but simply unavailable or did you not see any option to do so at all? ...
    (microsoft.public.windowsxp.general)
  • Re: New HDD Installation
    ... Right clicking on the partition or using the | Action> All Tasks menu at the top of the console should show the option, either greyed out or usable to make the partition active. ... Do the fixboot and fixmbr things in the Recovery Console then see if things change. ... There are other ways to mark the partition active, either with a Windows 98 startup disk or using the DISKPART command while in Windows XP. ...
    (microsoft.public.windowsxp.general)
  • Re: logging console login
    ... What i want in common with Linux and Solaris is console logins to ... least network logins can be bared from root login and force people to login as themselves and switch user, but root login at the console is ... Solaris and Linux log this event without any further configuration, but with HP and AIX for that matter they dont at the moment. ...
    (comp.sys.hp.hpux)
  • Re: Reading disk from BIOS
    ... > the disk without the disk driver being loaded. ... The console BIOS can communicate ... with all the devices it may be expected to boot from, and going to LBN ... The "boot block" is an integral part of INDEXF.SYS, if LBN 0 is bad then ...
    (comp.os.vms)