Re: Sudden reboot
From: Methusela Oreiley (wilsonag_at_gte.net)
Date: 06/26/03
- Next message: Michael W. Folsom: "Patch problem (109147-24)"
- Previous message: Greg Andrews: "Re: Tcpwrappers issue on Solaris 8"
- In reply to: Tanya: "Sudden reboot"
- Next in thread: Mr. Johan Andersson: "Re: Sudden reboot"
- Reply: Mr. Johan Andersson: "Re: Sudden reboot"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Date: Thu, 26 Jun 2003 02:01:07 GMT
[[ This message was both posted and mailed: see
the "To," "Cc," and "Newsgroups" headers for details. ]]
There are two ways to reboot UNIX (in theory):
1 - Go to run level 6 (commanded by root or power monitor).
2 - Kernel executed panic() (will eventually hit run level 6).
That is a very short list. Lack of a log entry exclused root command
and power monitor.
panic() is executed to "bail" when a "bad thing" happens. Bad things
usually occur in situations where software reaches an invalid state.
This is usually due to the following.
A - program error (buggy driver or function)
B - hardware error (broke or incompatible hardware)
This is also a very short list.
Intermittent problems (such as this one) can only be captured using
crash files.
When an application execute panic(), it will issue an API call to the
kernel that will will immediately suspend the process (zombie), the
kernel will record memory from the application into a crash file for
debugging, then the process will be harvested. This can be used to
debug the application. When the kernel performs panic(), it becomes
unable to handle things like file systems, so crash files are recorded
by an entirely different means, but the usage is identical
(troubleshooting).
The panic() function in the kernel calls a dump routine which will
store off kernel RAM into the swap partition. In the next boot (which
hapens right away), the crash utility is run. If a kernel dump is found
in the swap partition by the crash utility, it will compress it and
save it to a file (probably in /var or /adm). This occurs before
virtual memory is started.
The kernel may need to be reconfigured or rebuilt to obtain a crash
file, and a local disk needs to be connected (kernel dump will probably
not work over NFS).
The debug utility is used to analyze the crash file (probably kdb). You
will need to specify the location of the kernel file and the location
of the crash file. There are four important functions required for
debugging (use "?" to show a list within the debugger).
Process list
Process selection
Stack walkback
Register dump
First, dump out the process list. Second, select each process (one at a
time) and display the stack and registers. You are looking for the
register set for a process that showns (panic) in the stack. This might
be the "active" process (the last one running).
The chain of events that ended up at the panic() function may start
with an interrupt. Most kernel events would start with an API call from
an application, but an unimplemented interrupt or error interrupt could
have occured. Interrupts will stop whatever function was active and
start a new function. Interrupt service functions probably have "intr"
in the function name (may be easy to recognize), and all interrupt
handling functions are in kernel memory. The interrupt may have been
triggered by an API call from a process that became active shortly
before panic(). In other words: the functions later in the stack may
not have been called by those listed earlier in the stack.
Unfortunately I do not know enough about SPARC to help you further.
Once you get this far, you should be able to get further assistance to
isolate the cause.
You may get more information by typing "kernel+driver+development" into
the search menu at http://docs.sun.com.
Best of luck.
Greg Wilson
nanoatzin99@netscape.net
---------
In article <db4636cf.0306200450.e27c3d9@posting.google.com>, Tanya
<tanya.levitsky@getronics.com> wrote:
> I am monitoring Unix Servers for customer, and yesterday I had one of
> the servers reboot suddenly without any errors or warnings. I checked
> /var/adm/messages, authlog, no users were loged on, and there is no
> evidence of any problems. The server is Sun 3800, split in 2 domains
> (only one went down), running Solaris 8. Has something like that ever
> happen to anyone? Am I missing anything? I would really appretiate any
> input. Thanks.
- Next message: Michael W. Folsom: "Patch problem (109147-24)"
- Previous message: Greg Andrews: "Re: Tcpwrappers issue on Solaris 8"
- In reply to: Tanya: "Sudden reboot"
- Next in thread: Mr. Johan Andersson: "Re: Sudden reboot"
- Reply: Mr. Johan Andersson: "Re: Sudden reboot"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Relevant Pages
|