Re: ten thousand small processes

From: Terry Lambert (tlambert2_at_mindspring.com)
Date: 06/25/03

  • Next message: D. J. Bernstein: "Re: ten thousand small processes"
    Date: Wed, 25 Jun 2003 02:26:02 -0700
    To: "D. J. Bernstein" <djb@cr.yp.to>
    
    

    "D. J. Bernstein" wrote:
    > As I said, I don't particularly care about the text segment. I'm not
    > talking about ten thousand separate programs.
    >
    > Why does the memory manager keep the stack separate from data?

    Stack needs to be executable for the current signal trampoline
    code. FreeBSD might be able to adopt the OpenBSD "non-executable
    stack" code to let it do this, too, but that would really depend
    on the interaction with the various threads implementations, all
    of which have signal code that's likely to be unhappy with it.

    > Suppose a
    > program has 1000 bytes of data+bss. You could organize VM as follows:
    >
    > 0x7fffac18 0x7fffb000 0x80000000
    > <---- stack data+bss text, say 5 pages heap ---->
    >
    > As long as the stack doesn't chew up more than 3096 bytes and the heap
    > isn't used, there's just one page per process.

    The pages have to be on page boundaries. The absolute minimum
    number of pages you could have is 2, assuming you take your
    approach to putting the load address up high.

    In general, however, realize that the total virtual address
    space consists of (KVA + UVA = 4G). The model you propose,
    of throwing the heap out at 0x80000000 implies that the KVA
    and UVA are seperate.

    While this is possible, in theory, in practice it's a bad idea,
    since the uiomove/copyin/copyout code depends on the ability
    to run in system space and map the user address at its expected
    location (in fact, it is mapped there by default on kernel entry,
    to avoid having to establish mappings for data copies that cross
    protection domains).

    Breaking this would have some serious performance consequences,
    in terms of needing to create wired mapping for user pages in
    kernel space when copying data in and out -- and having to do
    explicit address translations, which would significantly
    damage a number of aspects of performance.

    So instead, the stack and heap grow towards each other, and the
    stack does not live in a data page.

    In general, the use of malloc() that is causing your primary
    page-count problem is tunable, via the malloc symlink flags hack,
    to get the total count down.

    However, the counts is probably never going to go below 4 pages,
    if there is any heap memory in use at all, no matter what you do,
    and 5 or more, if you count page table entries, etc., against the
    process.

    FWIW, most of these pages are phantoms -- that is, they exist as
    page mappings with no backing pages, and backing pages are allocated
    on fault at reference time.

    > As for page tables: Instead of allocating space for a bunch of nearly
    > identical page tables, why not overlap page tables, with the changes
    > copied on a process switch?

    Each page table descriptor is capable of handling 4M of memory,
    either directly, as a 4M page mapping (not used for user processes,
    except for mmap'ed device memory with proper size and alignment, if
    it goes out of its way to establish the mapping). The typical mapping
    in a PTD pointing to a 4K page containg PTE's.

    For the most part, overlap doesn't buy you anything in this case,
    because most programs are linked shared, which means they almost
    instantaneously have a delta on pages between the two processes,
    even if they were forked identically, as a result of the page (set)
    that contains the glue for the dynamic object references.

    Even if you were talking a statically linked program, then you will
    need to worry about data pages that are copy-on-write written as soon
    as you start doing stack and heap accesses, so the page mappings are
    not shareable. The closest you will come to shareable page mappings
    is going to be rfork() (where they *are* shared, but are not COW, so
    it does you no good in your suggested case), or the case that the
    data space is very, very much larger than 4M, at which point it *may*
    make sense to make them COW.

    I'm pretty sure this would not work at all for the 386, which does
    not support taking write faults on pages that are not marked writeable,
    when the processor is in system mode -- this is a bug in the 386 page
    protection implementation, which has to be worked around for security
    reasons by hacking the page mapping and taking a page not present
    fault, and then fixing it up based on apriori knowledge of the page
    mapping; failure to do this extra work would mean that someone could
    read or write arbitrary kernel addresses with arbitrary data, thus
    escalating priviledges by overwriting credentials.

    Even throwing out the 386, I'm pretty sure that the fault on a page
    that contains page mappings is not defined for some circumstances,
    at least from my reading of the IA32 architecture manual. So to
    deal with this, you would probably need to move to explicit TLB
    shootdowns for most cases, and, again, you are screwed in the shared
    mappings department.

    For the most part, this won't work anyway, since most x86 UNIX
    systems don't use a separate GDT entry per process, but instead
    have one for the recursive mapping, one for the kernel, one for
    text, and one for data. Occasionally, they also support one per
    VM86() instance.

    Changing this would have the negative effect of limiting the
    total number of processes you could run simultaneously (see the
    Linux mailing list archives for why they switched away from using
    the TSS in order to do context switching: all TSS descriptors must
    reside in the GDT, which has only 8192 entries).

    It's not worth accepting this limit, in order to obtain page table
    sharing for a minority of tasks, since you seem to be interested
    in small tasks which could not benefit from doing this anyway.

    I really reccomend:

            Protected Mode Software Architecture
            Tim Shanley
            MindShare, Inc.
            Addison-Wesley Publishing Company
            ISBN: 0-201-55447-X

    Specifically:

            Chapter 8: Code Segments
            Chapter 9: Data and Stack Segments
            Chapter 11: Mechanics of a Task Switch
            Chapter 13: Virtual Paging

    > As for 39 pages of VM, mostly stack: Can the system actually allocate
    > 390000 pages of VM? I'm only mildly concerned with the memory-management
    > time; what bothers me is the loss of valuable address space. I hope that
    > this 128-kilobyte stack carelessness doesn't reflect a general policy of
    > dishonest VM allocation (``overcommitment''); I need to be able to
    > preallocate memory with proper error detection, so that I can guarantee
    > the success of subsequent operations.

    It is, in fact, memory overcommit. It would take some work to
    disable this. BSD systems have used memory overcommit ever since
    they adopted the Mach VM architecture.

    > As for malloc()'s careless use of memory: Is it really asking so much
    > that a single malloc(1) not be expanded by a factor of 16384?

    You need to ask malloc() for this behaviour explicitly; it's not
    implicit in the default system configuration.

    The primary reason for this is that most programmers who have
    been trained in the last two decades believe that "All the world's
    a VAX"; e.g.: "memory is free" and "Intel Byte order is right and
    network byte order is wrong", and other untruths.

    Given this reality, most programs these days are what a lot of us
    who were trained up when 32K was an *amazing* amount of memory
    would politely call "bloated pigs". It makes sense, in that case,
    to be aggressive in allocation of memory, in the expectation that
    programs will tend to be memory-unconscious enough that they will
    tend to use everything you preallocate on their behalf.

    [ ... default malloc behaviour complaints that are controllable
          via flags settings ... ]

    > (Quite a few of my programs simulate this effect by checking for space
    > in a bss array, typically 2K. But setting aside the right amount of
    > space would mean compiling, inspecting the brk alignment, and
    > recompiling. I also feel bad chewing up space on systems where malloc()
    > actually knows what it's doing.)

    I defy you to name one modern UNIX variant that doesn't overcommit
    memory. I doubt there is a malloc() existant on these systems that
    "actually knows what its doing".

    BTW: I would class your suggestion of COW page table sharing as a
    type of memory overcommit, as well.

    If you can find a counterexample, the logical thing to do is to
    port it and carry it around with your code, if you need that type
    of behaviour, since most libc malloc() implementations are defined
    to use weak symbols, to permit you to override their malloc/free
    implementation completely. So long as your implementation conforms
    to POSIX, everything else (strsave(3), et. al.) should "just work"
    with the replacement.

    > As for the safety of writing code that makes malloc() fail horribly:
    > After the Solaris treatment of BSD sockets, and the ``look, Ma, I can
    > make an only-slightly-broken imitation of poll() using select()!''
    > epidemic, I don't trust OS distributors to reserve syscall names for
    > actual syscalls. I encounter more than enough portability problems
    > without going out of my way to look for them.

    Malloc isn't, and never has been, a system call on POSIX/UNIX
    systems. It's *always* been in section 3, for long as it's been
    an accepted part of the OS. Only things whose manual pages are
    in section 2 are system calls, and that's not a guarantee that
    they will stay there: basically, it's legal to implement all of
    POSIX on top of asynchronous system calls with a wait primitive,
    and provide all of the "expected" POSIX semantics in nothing more
    than a user space library.

    -- Terry
    _______________________________________________
    freebsd-performance@freebsd.org mailing list
    http://lists.freebsd.org/mailman/listinfo/freebsd-performance
    To unsubscribe, send any mail to "freebsd-performance-unsubscribe@freebsd.org"


  • Next message: D. J. Bernstein: "Re: ten thousand small processes"

    Relevant Pages

    • Re: style question,itoa
      ... able to allocate buffers in some convenient location the callee ... dynamically allocate a block of memory and computing this size to use ... call might require allocating a page for the stack. ... " Checking every single malloc in a bigger application for possible ...
      (comp.unix.programmer)
    • Re: ten thousand small processes
      ... them roughly in the middle you take space away from heap and stack ... identical across two processes as is the case with large shared memory ... There is no special allocation for virtual address space that is ... the standard libc or at least not the standard malloc implementation. ...
      (freebsd-performance)
    • Re: No need to optimize in assembly anymore
      ... > malloc or new functions if memory is not very big. ... Allocating more than 4KB on stack ... Let's imagine that you allocate 10MB of memory (it is 10 times more than the ...
      (comp.lang.asm.x86)
    • Re: No need to optimize in assembly anymore
      ... > malloc or new functions if memory is not very big. ... Allocating more than 4KB on stack ... Let's imagine that you allocate 10MB of memory (it is 10 times more than the ...
      (alt.lang.asm)
    • Re: ten thousand small processes
      ... "stack and data have different VM protections and, ... >> writing a general purpose malloc is significantly more complicated, ... Calling mallocfor the first time causes a 16K spike in memory usage under ... are wonderful decisions for a memory allocator tuned for a very ...
      (freebsd-performance)