Re: Dying processes (inetd, cron, syslogd, sshd)

From: Bela Lubkin (filbo_at_armory.com)
Date: 08/08/05


Date: 8 Aug 2005 06:10:34 -0400

keith@actual-systems.com wrote:

> Anyone have any idea's on this problem?

I posted on August 1st, but never saw it come back to me. This time I'm
Bcc'ing you so you'll see it even if USENET swallows it again...

keith@actual-systems.com wrote:

> We are having problems on various SMP machines (5.0.6a + rs506a
> installed) where at times of large load most of the running processes
> just seem to stop (e.g. inetd, cron, syslogd, sshd,....) This always
> seems to occur at times of large stress to the disks, but we have never
> managed to put our fingers on exactly what is causing it. When it does
> happen not only does the inetd process die, but also cron and syslog
> which makes it very tricky for us to put anything in place to try and
> catch what is happening.
>
> We are able to ping the machine when it does happed and also login at
> the console and over a modem but not over a telnet of ssh connection.
>
> We have had an issue open with SCO before who advised us to install
> scodb and set it to trigger when the inetd process stops - and when it
> does to get a sysdump. We have tried this, but the sysdump created was
> too big for swap - do you know of any way from within scodb to reduce
> the size of the sysdump created?
>
> This machine (which has had the problem once a day for the last three,)
> is used as a backup server in our office. All that runs on it is two
> rsync's of our main machine - one for mail/uucp spools, and one for the
> main data. The problem always has occured during these rsyncs, normally
> when transferring a large file.

scodb can't reduce the size of a crash dump, but you can force the dumps
to fit by limiting the amount of memory seen by the kernel. To do this,
append " mem=1m-100m" to DEFBOOTSTR in /etc/default/boot (substituting a
bit less than actual size of your dump area in place of "100m").

The load you describe would probably run in 12MB of RAM, but don't limit
memory more than you have to. The problem might be memory size-related.
You want to keep as much as you can of the machine's normal memory size.

[new material begins]

> What would be the outcome if you had one process that kept on wanting
> more and more resource?

There are some problem scenarios like that. A common one is a process
spinning out of control, allocating more and more memory. It will
eventually use all available memory; its next allocation attempt will
fail, and in most cases it will then die. Unless you have changed the
defaults, such a process usually writes a core dump. On OSR5, during
the dumping of a process's core, the process continues to own all of its
memory until the dump is complete. This means that the machine remains
critically out of memory for a long time. The process may have grown
nearly as large as your combined RAM + swap. To dump it, not only does
the kernel have to write that much data, it also may have to page a
large portion of it in from swap. This can take many minutes with large
memory and a slow disk...

During that period, other processes that try to allocate memory will
usually fail. Their subsequent behavior depends on their error
handling. Some will dump core, some will exit gracefully, some may even
stay up. And some may get into weird catatonic states.

> Do other processes hold onto the resource they have or will they
> eventually get 'bullied' out of the resource they are using and
> essentially stop (which theoretically would give the results I am
> seeing.)

For memory, a "hog" process will cause others to get written out to
swap, but those processes still "own" their memory (it will get paged
back in if they need to access it). The troubles happen when a process
tries to allocate more memory while the system is strapped.

There are probably other resources where similar things could happen.

> Any ideas? or does anyone have any idea's as to how I would track down
> what was causing this to happen.

If you had a process spin out and dump, it would leave a huge core file
that you would be able to find. If a process spins out and dies
_without_ leaving a dump, a more subtle trace is left. Normally, OSR5
doesn't use any swap at all; `swap -l` will have identical values in the
"blocks" and "free" columns. ("Normal" modern systems have enough RAM
that they never need to invoke the tremendous performance loss of
swapping.) After such an incident, `swap -l` will show quite a bit of
swap in use. This represents pages that got pushed out, and whose
processes have never actually needed to access them since the incident.

What does `crash` "p" show in the "EVENT" column for the hung processes?

>Bela<



Relevant Pages

  • Re: Swap space
    ... > twice the amount of RAM if you need to capture a dump for debugging. ... > If you won't ever be doing that, you may not need so much swap. ... least the size of physical memory. ...
    (freebsd-questions)
  • Re: kswapd high CPU usage with no swap
    ... Intel Core CPU. ... disk, and I have 1GB of RAM and no swap. ... I don't use any swap because my applications rarely need more memory ...
    (Linux-Kernel)
  • RE: Hangs during "dump" with 6.0 and current ports
    ... Due to lack of memory I had added an additional swap-file via mdconfig 3 ... filesystem with snapshots active and the swapfile is in the same filesystem ... Hangs during "dump" with 6.0 and current ports ... Maybe you need some more swap space? ...
    (freebsd-questions)
  • Daily Report #4852
    ... Verify Guide Star Acquisition with Continuing FGSs ... Load and Dump Onboard Memory ... At the beginning of each test, the attitude control law ...
    (sci.astro.hubble)
  • Problems installing 5.1A on XP1000
    ... The drive I'm attempting an install to is an IBM ... 640 MBytes of System Memory ... isa0 at pci0 ... DUMP: Warning: no disk available for dump. ...
    (Tru64-UNIX-Managers)