Solaris 2.8 & 2.9 kernel eating all my memory?

From: William Hathaway (wdh_at_perfectorder.com)
Date: 03/19/04


Date: 18 Mar 2004 15:54:19 -0800

I'm working with a set of 280Rs (2x750mhz or 2x900, 2GB) that are used
as load generators (LG) for a tcp based application written in C.
Previously the machines had been tweaked to allow up to approx 60k
simultaneous outbound connections each (lowering
tcp_smallest_anon_port, raising tcp_conn_hash_size and fd limits). The
load generation test involves connecting to the remote servers on a
socket, sending a few hundred bytes back and forth, and then opening
the next socket (leaving current one open).

  In the past, I had been able to run many tests with 50k or so
simultaneous connections without any problems. We've had some code
updates (including our tcp application level protocol now including
NULL characters), and I had not ran a big load test in a while. When
trying to run the same load tests that had previously ran without a
hitch, all the LG boxes became hung.

  Further investigation showed that the machines were running out of
memory once 10-14k connections were established. The application used
for the testing (customized version of open source program
"pasvlogin") was only using approx 100M of memory(via ps,prstat,top),
besides that application, only the basic OS programs are on the
machine. I started watching the system memory use via mdb's ::memstat
command and saw that the vast majority of the memory allocation was
going to kernel space.

Here is a catastrophic sample taken from a panic dump I forced
by dropping the machine to the ok prompt and running 'sync' after it
had hung
>mdb -k *3
Loading modules: [ unix krtld genunix ip ipc ufs_log usba nfs ptm ]
> ::memstat
Page Summary Pages MB %Tot
------------ ---------------- ---------------- ----
Kernel 246987 1929 100%
Anon 62 0 0%
Exec and libs 0 0 0%
Page cache 10 0 0%
Free (cachelist) 331 2 0%
Free (freelist) 156 1 0%

Total 247546 1933

During the test, I could clearly see the kernel memory usage rising
sharply as the number of tcp connections increased. There was no
other activity on the machine besides the load test and a few
monitoring commands running such as vmstat,netstat.

A sample of a ::memstat before launching the test:
mdb -k
Loading modules: [ unix krtld genunix ip ipc ufs_log usba nfs random
ptm ]
> ::memstat
Page Summary Pages MB %Tot
------------ ---------------- ---------------- ----
Kernel 16458 128 7%
Anon 1732 13 1%
Exec and libs 756 5 0%
Page cache 110 0 0%
Free (cachelist) 227632 1778 92%
Free (freelist) 858 6 0%

Total 247546 1933

My questions are:
   * Should an application that just opens and does a few reads/writes
from sockets (no other IPC performed) be able to cause the kernel to
use so much memory?

   * Is there any other tactic/techniques I can use to trace down what
is causing this?

I'm working on a very small version of a tcp client/server so I can
run these tests with the original data protocol used (and variations)
to see if somehow the app data going across the socket is triggering
the extreme kernel memory usage, but am I off-base that this shouldn't
be happening?

Machines were originally running Sol 8 KP 22, once problem was
noticed, KP 27 was applied, and since that didn't help, machines were
re-jumped to Sol 9 KP 11, which still didn't help (but at least now I
have ::memstat :-) )

Any comments or suggestions or RTFM (but say which one) are most
welcome!
Thanks,
-William Hathaway
wdh@perfectorder.com



Relevant Pages

  • Re: [KORG] Re: kernel.org lies about latest -mm kernel
    ... on the frontend machines our basic working set no longer stays resident ... a much higher I/O load. ... and new rsync connections get pushed elsewere) ... server load monitoring and letting it bias new connections to nodes ...
    (Linux-Kernel)
  • VM in 2.6 doing a worse job of caching than 2.4?
    ... day by a proprietary database system. ... I recently started evaluating the 2.6 kernel for these machines. ... and see's CPU idle of 30-35%. ... Theoretically, they both should receive similar traffic, though the load ...
    (comp.os.linux.development.system)
  • Re: Very high load on P4 machines with 2.4.28
    ... The machines have normal load averages hovering not higher than ... Booted back in the old kernel, ... other box with the similar configuration to the virtuals (also a virtual ...
    (Linux-Kernel)
  • Re: NBT woes
    ... > only TCP with NetBios over TCP enabled. ... sometimes 98 machines ... > 2000 file server to be the master browser, ... > keeps network guys employed. ...
    (microsoft.public.cert.exam.mcse)
  • Re: Virtualization options for Sparc?
    ... since I am not expecting a whole lot of ... load, but obviously all-out cpu emulation is not an option. ... UltraSparc II based machines. ... non-virtualized server has got only 512 MB and is doing ...
    (comp.sys.sun.admin)