Re: Abnormal process kill.

From: Chuck Dillon (cdillon_at_nimblegen.com)
Date: 05/09/03


Date: Fri, 09 May 2003 10:10:27 -0500

Neil wrote:
> Hi there,
>
> I thought I'd post this, as frankly, I'm stumped. I have no idea and no
> clue.
>
> One of my customers is running an HPUX system on 11.00 and has a problem
> in which processes die for no reason. The machine is also up to date
> with all HPUX patches.
>
> There are no core files, no log record and no specific user-id (i.e. the
> process termination appears to be random).
>
> Generally speaking, application processes do not simply die without
> reason. My customer thinks that this may be performance related, but as
> there are no logs or any other info, I'm doubtful.

Most likely they have a bug that the developers can't/won't recognize
so they're pointing at the system.

I've no experience with HP-UX but generally speaking core files can
also be suppressed if the shell's corefilesize or applications
RLIMIT_CORE is set to zero. It could be that the developers did one of
these for normal operations.

If the victim programs are really being launched from scripts then it
should be straightforward to modify that/those script(s) to provide
more information. For example, running the programs through a system
call tracer. That will change the timing of things and might reduce
the frequency of occurrence. I'd limit the calls being traced to
things like signals and exit conditions.

-- ced

>
> Questions and answers that I have already checked out are below:
>
> Q. Conditions such as memory depletion, where the process needs more
> memory, but memory reservation or swap-space is not available.
>
> A. This doesn't appear to be a problem, as the machine is not working at
> full capacity.
>
> Q. Violating per-process resource limit(s) like cpu-time limit.
>
> A. This may be but, this problem also occurred on very calm days.
>
> Q. Upon such condition, the operating system will signal this to the
> process, and the process may act on this thru use of its
> signal-handlers. In case that no signal-handler was setup by the
> process, default action will be taken.
>
> A. All the processes have signal-handler routines, which handle SIGKILL
> type signals. In any case, all the processes do normal termination by
> sending SIGKILL signals with kill <process_id>. (Authorized operators
> are handling the terminations on main programs with menu options. Main
> programs terminate their subprograms automatically.) For our problem
> case, they seem to be terminated as if, by kill -9. kill -9 command
> generates SIGABRT signal and as this signal is an "operating system
> level signal", it couldn.t be handled in programs.
>
> Q. But in general, when running program/scripts from command line, the
> executing shell will receive a notification of a failed process.
>
> A. All of our application programs run via the shell scripts running
> inside the main programs in the background mode with nohup process_name
> &.
>
> During startup of our main programs, they push themselves into the
> background mode with setpgrp() and fork() commands, just after
> completing their initial controls.
>
> Q. Such "core" files may *not* be created if the process' current
> directory cannot be written, or if the application is running with
> set-uid/set-gid bits (and the real user is different from the file
> owner).
>
> A. set-gid {setpgrp()} is only used in our 4 broadcasting programs and
> these broadcasting programs do nothing during broadcast, rather,
> subprograms do all the job. Such a problem hasn.t been encountered on
> these programs, yet. Also they have the necessary rights on "current
> directory".
>
> But still, the processes could be altered to handle some of above
> signals (SIGBUS, SIGSEGV, SIGXCPU). (Also it should be considered that
> there are at about 60 processes subject to alteration).
>
> Q. which process(es) is/are using much CPU? And what is the relation of
> this with the unexpected termination of processes?
>
> A. If we could find a relation of this with unexpected termination, we
> should interfere in the problem with certain methods like separating the
> functions of the processes or utilizing the function.
>
> Q. under what user-id are (were) the affected processes running?
>
> A. Operators run the broadcasting programs with the aid of built-in
> menu.
>
> Q. is there any application log(s) that provides information on process
> termination?
>
> A. Majority of our programs record their stop time into their individual
> log files. But the programs subject to the process kill problem could
> not record their stop time into log the file, they just die before.
>
> Q. is there any "core" file generated (if no indication of core files:
> is there any "core" file anywhere on the system)?
>
> A. No, this has not been seen.
>
> Q. are there any messages anywhere when a process terminates
> unexpectedly?
>
> A. They generate no message while dying.
>
> Anyone seen this sort of problem before? Any ideas?
>
> TIA,
>
> Neil

-- 
Chuck Dillon
Senior Software Engineer
NimbleGen Systems Inc.


Relevant Pages

  • Abnormal process kill.
    ... all the processes do normal termination by ... sending SIGKILL signals with kill. ... case, they seem to be terminated as if, by kill -9. ... these broadcasting programs do nothing during broadcast, rather, ...
    (comp.unix.admin)
  • Re: Design Questions on Termination
    ... not need signals at all for my implementation. ... - it is used to request a config reload request. ... get a termination request, how do I gracefully shutdown that thread? ... replace the selectwith any other blocking operation (e.g. openssl ...
    (comp.programming.threads)
  • Re: IEEE-1284 problem
    ... I have been making a motion control board that communicates ... It works fine using just any Dell desktop's on-motherboard parallel port. ... I have looked at the signals extensively with both a scope and a logic analyzer, and can't find any significant differences. ... Termination matters, but things often work when only one side is terminated; it may be that your board isn't terminated, the motherboard is, and your add-in board isn't. ...
    (sci.electronics.design)
  • Re: M2N32 WS professional and 4x1GB: Solved?
    ... same timing and eeprom settings), and this worked,confirming that the ... "CPU on die termination" was set to 300R, which is ok for 1 module, ... as high as the termination allows, just delays the signals too ...
    (alt.comp.periphs.mainboard.asus)
  • M2N32 WS professional and 4x1GB: Solved?
    ... same timing and eeprom settings), and this worked,confirming that the ... "CPU on die termination" was set to 300R, which is ok for 1 module, ... as high as the termination allows, just delays the signals too ...
    (alt.comp.periphs.mainboard.asus)

Loading