Re: ten thousand small processes

From: Chuck Swiger (cswiger_at_mac.com)
Date: 06/26/03

  • Next message: D. J. Bernstein: "Re: ten thousand small processes"
    Date: Wed, 25 Jun 2003 18:57:51 -0400
    To: "D. J. Bernstein" <djb@cr.yp.to>
    
    

    D. J. Bernstein wrote:
    [ ... ]
    > Why does the memory manager keep the stack separate from data? Suppose a
    > program has 1000 bytes of data+bss. You could organize VM as follows:
    >
    > 0x7fffac18 0x7fffb000 0x80000000
    > <---- stack data+bss text, say 5 pages heap ---->
    >
    > As long as the stack doesn't chew up more than 3096 bytes and the heap
    > isn't used, there's just one page per process.

    Remember that VMM hardware requires page-alignment: TEXT should be on pages
    marked X (or RX if the local architecture needs it), DATA+BSS should be RW, and
    I think FreeBSD needs the stack to be RWX. We need to consider the kernel's
    address space-- 32-bit systems generally reserve the top 2GB, or sometimes less,
    exclusively for the kernel.

    Besides, most programs are probably not built as PIC and could not have their
    starting address relocated arbitrarily, although perhaps it would be interesting
    to consider the following process address map:

    VM Address Usage
    0x0 PAGEZERO
    0x4000 XXX bytes reserved per the hard limit to process stack size
    XXX (+ 0x4000) TEXT segment
    YYY DATA + BSS
    ZZZ heap
    0x80000000 KVA

    > As for page tables: Instead of allocating space for a bunch of nearly
    > identical page tables, why not overlap page tables, with the changes
    > copied on a process switch?

    Mach uses copy-on-write for VMO's associated with the virtual address space used
    by processes, which are similar abstractions to the page table entries used
    under classic BSD. From "man vmmap":

          The share mode describes whether pages are shared between processes, and
          what happens when pages are modified. Private pages (PRV) are pages only
          visible to this process. They are allocated as they are written to, and
          can be paged out to disk. Copy-on-write (COW) pages are shared by multi-
          ple processes (or shared by a single process in multiple locations).
          When the page is modified, the writing process then receives its own copy
          of the page. Empty (NUL) sharing implies that the page does not really
          exist in physical memory. Aliased (ALI) and shared (SHM) memory is
          shared between processes.

          The share mode typically describes the general mode controlling the
          region. For example, as copy-on-write pages are modified, they become
          private to the application. Even with the private pages, the region is
          still COW until all pages become private. Once all pages are private,
          then the share mode would change to private.

          The far left column names the purpose of the memory: text segment, data
          segment, allocated via malloc, stack, etc. For regions loaded from bina-
          ries, the far right shows the library loaded into the memory.

          Some lines in vmmap's output describe submaps. A submap is a shared set
          of virtual memory page descriptions that the operating system can reuse
          between multiple processes. The memory between 0x70000000 and
          0x80000000, for example, is a submap containing the most common dynamic
          libraries. Submaps minimize the operating system's memory usage by rep-
          resenting the virtual memory regions only once. Submaps can either be
          shared by all processes (machine-wide) or local to the process (process-
          only). If the contents of a machine-wide submap are changed -- for exam-
          ple, the debugger makes a section of memory for a dylib writable so it
          can insert debugging traps -- then the submap becomes local, and the ker-
          nel will allocate memory to store the extra copy.

    8-cube# vmmap 252
    ==== Non-writable regions for process 252
    __PAGEZERO 0 [ 4K] ---/--- SM=NUL syslogd
    __TEXT 1000 [ 20K] r-x/rwx SM=COW syslogd
    __LINKEDIT 7000 [ 4K] r--/rwx SM=COW syslogd
    Submap 90000000-9fffffff r--/r-- machine-wide submap
    __TEXT 90000000 [ 932K] r-x/r-x SM=COW ...System.B.dylib
    __LINKEDIT 900e9000 [ 260K] r--/r-- SM=COW ...System.B.dylib
    __TEXT 93a40000 [ 20K] r-x/r-x SM=COW ...Common.A.dylib
    __LINKEDIT 93a45000 [ 4K] r--/r-- SM=COW ...Common.A.dylib
    Submap a000b000-a3a3ffff r--/r-- process-only submap
    Submap a3a41000-afffffff r--/r-- process-only submap
                                    aff80000 [ 512K] r--/r-- SM=SHM

    ==== Writable regions for process 252
    __DATA 6000 [ 4K] rw-/rwx SM=PRV syslogd
    MALLOC_USED(DefaultMallocZone_ 8000 [ 20K] rw-/rwx SM=COW
    MALLOC_USED(DefaultMallocZone_ d000 [ 4K] rw-/rwx SM=ZER
    MALLOC_USED(DefaultMallocZone_ e000 [ 4K] rw-/rwx SM=COW
    MALLOC_FREE(DefaultMallocZone_ f000 [ 228K] rw-/rwx SM=ZER
    __TEXT 8fe00000 [ 288K] rw-/rwx SM=COW /usr/lib/dyld
    __DATA 8fe48000 [ 8K] rw-/rwx SM=COW /usr/lib/dyld
    __DATA 8fe4a000 [ 4K] rw-/rwx SM=COW /usr/lib/dyld
    __DATA 8fe4b000 [ 4K] rw-/rwx SM=ZER /usr/lib/dyld
    __DATA 8fe4c000 [ 12K] rw-/rwx SM=COW /usr/lib/dyld
    __DATA 8fe4f000 [ 144K] rw-/rwx SM=ZER /usr/lib/dyld
    __LOCK 8fe73000 [ 4K] rw-/rwx SM=NUL /usr/lib/dyld
    __LINKEDIT 8fe74000 [ 44K] rw-/rwx SM=COW /usr/lib/dyld
    Submap 90000000-9fffffff r--/r-- machine-wide submap
    __DATA a0000000 [ 4K] rw-/rw- SM=ZER ...System.B.dylib
    __DATA a0001000 [ 4K] rw-/rw- SM=COW ...System.B.dylib
    __DATA a0002000 [ 20K] rw-/rw- SM=COW ...System.B.dylib
    __DATA a0007000 [ 16K] rw-/rw- SM=PRV ...System.B.dylib
    Submap a000b000-a3a3ffff r--/r-- process-only submap
    __DATA a3a40000 [ 4K] rw-/rw- SM=COW ...Common.A.dylib
    Submap a3a41000-afffffff r--/r-- process-only submap
    STACK[0] bff80000 [ 508K] rw-/rwx SM=PRV
                                    bffff000 [ 4K] rw-/rwx SM=PRV

    ==== Legend
    SM=sharing mode:
             COW=copy_on_write PRV=private NUL=empty ALI=aliased
             SHM=shared ZER=zero_filled S/A=shared_alias

    ==== Summary for process 252
    ReadOnly portion of Libraries: Total=1572KB resident=1444KB(92%) swapped_out_or_
    unallocated=128KB(8%)
    Writable regions: Total=968KB written=40KB(4%) resident=88KB(9%) swapped_out=0KB
    (0%) unallocated=880KB(91%)

    > As for 39 pages of VM, mostly stack: Can the system actually allocate
    > 390000 pages of VM?

    I believe 390000 4K pages is 1523 MB: if you've got the datasize resource limit
    set high enough and you've got the RAM or swap space available, the answer to
    your question should be yes.

    > I'm only mildly concerned with the memory-management
    > time; what bothers me is the loss of valuable address space. I hope that
    > this 128-kilobyte stack carelessness doesn't reflect a general policy of
    > dishonest VM allocation (``overcommitment''); I need to be able to
    > preallocate memory with proper error detection, so that I can guarantee
    > the success of subsequent operations.

    Preallocate at compile time, or preallocate at process run time?

    > As for malloc()'s careless use of memory: Is it really asking so much
    > that a single malloc(1) not be expanded by a factor of 16384?
    >
    > Here's a really easy way to improve malloc(). Apparently, right now,
    > there's no use of the space between the initial brk and the next page
    > boundary. Okay: allocate that space in the simplest possible way---

    It's easy to write a memory allocator that performs a specific case well;
    writing a general purpose malloc is significantly more complicated, and
    FreeBSD's malloc is tuned for programs which are much larger than your example.

    If you know of a malloc() implementation that does better than FreeBSD's, and is
    suitable for SMP systems, let us know.

    -Chuck

    _______________________________________________
    freebsd-performance@freebsd.org mailing list
    http://lists.freebsd.org/mailman/listinfo/freebsd-performance
    To unsubscribe, send any mail to "freebsd-performance-unsubscribe@freebsd.org"


  • Next message: D. J. Bernstein: "Re: ten thousand small processes"

    Relevant Pages

    • Re: ten thousand small processes
      ... Stack needs to be executable for the current signal trampoline ... the use of malloc() that is causing your primary ... if there is any heap memory in use at all, no matter what you do, ... either directly, as a 4M page mapping (not used for user processes, ...
      (freebsd-performance)
    • Re: style question,itoa
      ... able to allocate buffers in some convenient location the callee ... dynamically allocate a block of memory and computing this size to use ... call might require allocating a page for the stack. ... " Checking every single malloc in a bigger application for possible ...
      (comp.unix.programmer)
    • Re: ten thousand small processes
      ... them roughly in the middle you take space away from heap and stack ... identical across two processes as is the case with large shared memory ... There is no special allocation for virtual address space that is ... the standard libc or at least not the standard malloc implementation. ...
      (freebsd-performance)
    • Re: No need to optimize in assembly anymore
      ... > malloc or new functions if memory is not very big. ... Allocating more than 4KB on stack ... Let's imagine that you allocate 10MB of memory (it is 10 times more than the ...
      (comp.lang.asm.x86)
    • Re: No need to optimize in assembly anymore
      ... > malloc or new functions if memory is not very big. ... Allocating more than 4KB on stack ... Let's imagine that you allocate 10MB of memory (it is 10 times more than the ...
      (alt.lang.asm)