Re: read vs. mmap (or io vs. page faults)

From: Matthew Dillon (
Date: 06/22/04

  • Next message: Jonathan Chen: "Re: Ports make search broken"
    Date: Mon, 21 Jun 2004 17:15:09 -0700 (PDT)
    To: Mikhail Teterin <>

    :The mmap interface is supposed to be more efficient -- theoreticly --
    :because it requires one less buffer-copying, and because it (together
    :with the possible madvise()) provides the kernel with more information
    :thus enabling it to make better (at least -- no worse) decisions.

        Well, I think you forgot my earlier explanation regarding buffer copying.
        Buffer copying is a very cheap operation if it occurs within the L1 or
        L2 cache, and that is precisely what is happening when you read() into
        a fixed buffer in a loop in a C program... your buffer is fixed in
        memory and is almost guarenteed to be in the L1/L2 cache, which means
        that the extra copy operation is very fast on a modern processor. It's
        something like 12-16 GBytes/sec to the L1 cache on an Athlon 64, for
        example, and 3 GBytes/sec uncached to main memory.

        Consider the cpu time cost, then, of the local copy on a 2GB file...
        the cpu time cost on an AMD64 is about 2/12 of one second. This is
        the number mmap would have to beat.

        As you can see by your timing results, even on your fastest box,
        processing a file around that size is only going to incur 1-2 seconds
        of real time overhead to do the extra buffer copy. 2 seconds is a hard
        number to beat.

        This is something you can calculate yourself. Time a dd from /dev/zero
        to /dev/null.

            crater# dd if=/dev/zero of=/dev/null bs=32k count=8192
            268435456 bytes transferred in 0.244561 secs (1097620804 bytes/sec)

            amd64# dd if=/dev/zero of=/dev/null bs=32k count=8192
            268435456 bytes transferred in 0.066994 secs (4006846790 bytes/sec)

            amd64# dd if=/dev/zero of=/dev/null bs=16m count=32
            536870912 bytes transferred in 0.431774 secs (1243407512 bytes/sec)

        Try it for different buffer sizes (16K through 16MB) and you will get
        a feel for how the L1 and L2 caches effect copying bandwidth. These
        numbers are reasonably close to the raw memory bandwidth available to
        the cpu (and will be different depending on whether the buffer fits in
        the L1 or L2 caches, or doesn't fit at all).

        The mmap interface is not supposed to be more efficient, per say. Why
        would it be? There are overheads involved with mapping the page table
        entries and taking faults to map more. Even if you pre-mapped everything,
        there are still overheads involved in populating the page table and
        performing invlpg operations on the TLB to reload the entry, and for
        large data sets there is overhead involved with removing page table
        entries and invalidating the pte. On a modern cpu, where an L1 cache
        copy is a two cycle streaming operation, the several hundred (or more)
        cycles it takes to process a page fault or even just populate the
        page table is equivalent to a lot of copied bytes.

        This immediately puts mmap() at a disadvantage on a modern cpu, but of
        course it also depends on what the data processing loop itself is
        doing. If the data processing loop is sensitive to the L1 cache then
        processing larger chunks of data is going to be make it more efficient,
        and mmap() can certainly provide that where read() might require buffers
        too large to fit comfortably in the L1/L2 cache. On the otherhand, if
        the processing loop is relatively insensitive to the L1 cache (i.e. its
        small), then you can afford to process the data in smaller chunks, like
        16K, without any significant penalty.

        mmap() is not designed to streamline large demand-page reads of data
        sets much larger then main memory. mmap() works best for data that
        is already cached in the kernel, and even then it still has a fairly
        large hurdle to overcome vs a streaming read(). This is a HARDWARE
        limitation. Drastic action would have to be taken in software to get
        rid of this overhead (we'd have to use 4MB page table entries, which
        come with their own problems).

        The overhead required to manage a large mmap'd data set can skyrocket.
        FreeBSD (and DragonFly) have heuristics that attempt to detect
        sequential operations like this with mmap'd data and to depress the
        page priority behind the read (so: read-ahead and depress-behind), and
        this works, but it only mitigates the additional overhead some, it
        doesn't get rid of it.

        For linear processing of large data sets you almost universally want
        to use a read() loop. There's no good reason to use mmap().

    :=: read: 10.619u 23.814s 1:17.67 44.3% 62+274k 11255+0io 0pf+0w
    :Well, now we are venturing into the domain of humans' subjective
    :perception... I'd say, 12% is plenty, actually. This is what some people
    :achieve by rewriting stuff in assembler -- and are proud, when it works

        Nobody is going to stare at their screen for one minute and 17 seconds
        and really care that something might take one minute and 27 seconds instead
        of one minute and 17 seconds. That's subjective truth.

        The type of test you want to do is this:

        [start timing]
        [read all data into memory]
        [stop timing] -> print timing results
        [start timing]
        [process all data]
        [stop timing] -> print timing results

        Now you have something practical you can look at... you can look at the
        I/O bandwidth required to bring the data into memory without the
        complications of whatever processing you are doing on the data being
        mixed in. *THEN* you can say something more definitive about the
        kernel overhead required to get the data into memory first, because
        you can definitely say what the 'bandwidth', or data rate, has been
        achieved in getting the data from the disk or kernel caches into
        your program's memory space (faulted in and everything, ready to access).
        You could then compare that to the times required to do it in a mixed
        environment (read-processing loop). If *THOSE* numbers are hugely
        different then you can say something definitive about the relative
        efficiency of the mixed mode processing verses just doing pure I/O,
        for both read() and mmap() independantly.


    :Put it into perspective -- 10-15% is usually the difference between
    :the latest processor and the previous one. People are willing to pay
    :hundreds of dollars premium...

        15% is nothing anyone cares about except perhaps gamers. I certainly
        couldn't care less about 15%. 50%, on the otherhand, is something
        that I would care about. But upgrading isn't just a function of raw
        cpu speed, it's also a function of general improvements in hardware
        and hardware interfaces... usb, usb2, firewire, sata, and so forth.


    :Besides, the differences can be higher. Here is from md5-ing a
    :2097272832-bytes file over NFS (on a Gigabit network, no jumbo frames).
    :The machine runs a FreeBSD-current on a single P4 2GHz:
    : mmap1: 17.115u 16.106s 2:20.84 23.5% 5+166k 0+0io 253421pf+0w
    : read1: 19.468u 12.179s 1:27.80 36.0% 4+163k 0+0io 0pf+0w
    : mmap2: 17.214u 13.265s 2:13.75 22.7% 5+165k 1+0io 204842pf+0w
    : read2: 19.142u 11.576s 1:20.22 38.2% 4+162k 0+0io 4pf+0w
    :mmap is 87% slower (or read is 38% faster)! According to `systat -if',
    :mmap was reading at about 13Mb/s, while read was consistently above
    :If this mmap-associated penalty is removed, the applications can save
    :some memory by not using the BUFSIZ (or bigger) buffers, and the
    :systems can save the time and effort of shuffling the memory from
    :kernel buffers into user space (and flushing the instruction and data
    :caches). The difference can be big -- on a CPU bound machine the sum
    :of user time and system time is much smaller with mmap. For example,
    :on this Solaris box running on Sparc-900MHz md5-ing a 16061698048-byte
    :file (FreeBSD behaves similarly on the P2 400MHz reported earlier):
    : mmap: 215.290u 48.990s 7:18.81 60.2% 0+0k 0+0io 0pf+0w
    : read: 184.240u 142.350s 5:46.31 94.3% 0+0k 0+0io 0pf+0w
    : (264.28 vs. 326.59 CPU seconds)
    :but read manages to saturate the CPU better -- 94% vs. 60% -- and win
    :the "wall clock" race repeatedly...
    : -mi

        I think this points to inefficiencies in NFS's getpages() interface over
        its read() interface. The read() interface (for NFS) definitely has better
        read-ahead characteristics. The NFS getpages() interface in FreeBSD
        is about as primitive as it is possible to make it and still work, and
        its only marginally better in DragonFly (we get rid of some KVM allocations
        and deallocations). In fact, I don't even think the NFS getpages interface
        uses the IOD's like the read interface does. I think it might actually be
        a synchronous interface.

        It would be nice if someone were to improve the NFS getpages interface.
        I might do it myself, if I can find the time down the road.

                                            Matthew Dillon
    _______________________________________________ mailing list
    To unsubscribe, send any mail to ""

  • Next message: Jonathan Chen: "Re: Ports make search broken"