Re: IA64 Linux VM performance woes.

From: Alexis Cousein (al_at_brussels.sgi.com)
Date: 04/21/04


Date: Wed, 21 Apr 2004 14:35:17 +0200

Michael E. Thomadakis wrote:

> Hello all.
>
> We are trying to deploy a 128 PE SGI Altix 3700 running Linux, with 265GB main
> memory and 10TB RAID disk (TP9500) :
>
> # cat /etc/redhat-release
> Red Hat Linux Advanced Server release 2.1AS (Derry)
>
> # cat /etc/sgi-release
> SGI ProPack 2.4 for Linux, Build 240rp04032500_10054-0403250031
>
> # uname -a
> Linux c 2.4.21-sgi240rp04032500_10054 #1 SMP Thu Mar 25 00:45:27
> PST 2004 ia64 unknown
>
> We have been experiencing bad performance and downright bad behavior when we
> are trying to read or write large files (10-100GB).
>
> File Throughput Issues
> ----------------------
> At first the throughtput we are getting without file cache bypass is at around
> 440MB/sec MAX. This specific file system has LUNs whose primary FC paths go
> over all four 2Gb/sec FC channels and the max throughput should have been
> close to 800MB/sec.
>
> I've also noticed that the FC adapter driver threads are running at 100% CPU
> utilization, when they are pumping data to the RAID for long time. Is there
> any data copy taking place at the drivers? The HBAs are from QLogic.
>
You do have to live with the Linux 2.4 block layer for some time -- and if
you're used to IRIX (or ever used the 2.6 layer).

Some other people commented about the higher layers being a piece of crock,
but you can safely ignore those: the Qlogics driver and XSCSI layer are
pretty solid (in the case of XSCSI, the architecture is close to that of
the corresponding IRIX layer). That part of the work is something SGI
should have nailed down pretty much or you.

It's worse for RAIDs than for JBODs, as they depend on lots of I/O
operations on flight for a single LUN (and as a result, really hate
long I/O operations to be cut in smaller pieces).

Of course, make sure that you've used xscsiqueue to set CTQ parameters
correctly, or not even the 2.6 block layer + XSCSI + Qlogics driver
combination will save you...

>
> VM Untoward Behavior
> -------------------
> A more disturbing issue is that the system does NOT clean up the file cache
> and eventually all memory gets occupied by FS pages. Then the system simply
> hungs.

There are quite a few engineering incidents logged on several of these aspects
-- I would indeed suggest that this forum may not be the appropriate place
to ask about those.

There are already quite a number of 2.4 patches that at least address some
issues; make sure you're current on them. You really, really, really, want
*10065* or successors on that machine at the least (note: assuming you are
not using CXFS). You seem to be running 10054.
>
> We tried enabling / removing bootCPUsets, bcfree and anything else available
> to us. The crashes are just keep comming. Recently we started experiencing a
> lot of 'Cannot do kernel page out at address' by the bdflush and kupdated
> threads as well.

That suggests you may be running with not much swap (or perhaps missing 10065).

Yes, IRIX wouldn't need any swap, but in high-I/O situations where the buffer
cache fills up the memory, it *is* possible for Linux to think it has to swap
pages when it should be reclaiming free buffer cache pages - I'd suggest
making sure that you have about 1/4th of the memory configured as swap, to
make the kernel resist the temptation of waking up the out of memory killer
when you don't want it to, and to let it sort things out by moving things to
swap (albeit unncessarily) when it's painted itself into a corner,
rather than letting it stomp on your feet.

> Tunning bdlsuh/kupdated Behavior
> -------------------------------
>
> One of our main objectives at our center is to maximize file thoughput for our
> systems. We are a medium size Supercomputing Center were compute and I/O
> intensive numerical computation code runs in batch sub-systems. Several
> programs expect and generate often very large files, in the order of 10-70GBs.
> Minimizing file access time is importand in a batch environment since
> processors remain allocated and idle while data is shuttled back and forth
> from the file system.

There are some things you can do in a batch environment if users are cooperative:
if they make large files in a scratch or tmp area you can identify, instead of
running bcfree (or in addition to it), you can use a program that calls posix_fadvise()
to tell the system you've finished with all the files in that directory (call
using a find command from an epilogue script).

Of course, trying to make most applications use FFIO (in C) and thus private
user-space caches instead of the buffer cache is also a worthwhile endeavour,
especially for the I/O hogs.
>
> Another common problem is the competition between file cache and computation
> pages. We definitely do NOT want file cache pages being cached, while
> computation pages are reclaimed.

With the 2.4 kernel, the kernel doesn't go off-node if it can reclaim clean
buffer cache pages on-node. Of course, that means you'd better be set up
to flush pages to disk fast enough to avoid on-node memory being filled with
dirty pages while you're still allocating memory.

> Ideally we need to:

Talk to support (and your local system engineer). There are many, many people
in engineering helping people (probably including you) on this, and the channels
should be working; at least they are from where I'm sitting...isn't Trey Prinz
already working with you on these issues?

With your setup, you'd be pushing the envelope even on an IRIX machine - and
it's only years of hard work that have made IRIX "do the right thing" in many
circumstances (but not all ;) ) automagically.

-- 
Alexis Cousein                   Senior Systems Engineer
alexis@sgi.com                   SGI/Silicon Graphics Brussels
<opinions expressed here are my own, not those of my employer>
If I have seen further, it is by standing on reference manuals.


Relevant Pages

  • Re: Is Greenspun enough?
    ... Most OSes memory map executables directly from the file system so code doesn't pollute the file cache or swap space. ...
    (comp.lang.lisp)
  • Re: [PATCH 0/8] zcache: page cache compression support
    ... Memory compression increases effective memory size and allows more ... chance' cache ... though there looked like still lots of swap. ... 0: Mallocing 32 megabytes ...
    (Linux-Kernel)
  • Re: [PATCH 0/8] zcache: page cache compression support
    ... Memory compression increases effective memory size and allows more ... chance' cache ... By tested those patches on the top of the linus tree at this commit d0c6f6258478e1dba532bf7c28e2cd6e1047d3a4, the OOM was trigger even though there looked like still lots of swap. ... 0: Mallocing 32 megabytes ...
    (Linux-Kernel)
  • Re: Is Greenspun enough?
    ... Most OSes memory map executables directly from the file ... >> system so code doesn't pollute the file cache or swap space. ... but executables have a twist. ...
    (comp.lang.lisp)
  • Re: Cached memory never gets released
    ... Stock linux 2.4.26 kernel. ... Due to flash bug 3M of memory gets lost due to font memory getting lost ... The output of "free" cache number steadily grows. ... longer to exhaust all of system memory with the cache. ...
    (Linux-Kernel)