Re: read vs. mmap (or io vs. page faults)
From: Matthew Dillon (dillon_at_apollo.backplane.com)
Date: Mon, 21 Jun 2004 17:15:09 -0700 (PDT) To: Mikhail Teterin <Mikhail.Teterin@Murex.com>
:The mmap interface is supposed to be more efficient -- theoreticly --
:because it requires one less buffer-copying, and because it (together
:with the possible madvise()) provides the kernel with more information
:thus enabling it to make better (at least -- no worse) decisions.
Well, I think you forgot my earlier explanation regarding buffer copying.
Buffer copying is a very cheap operation if it occurs within the L1 or
L2 cache, and that is precisely what is happening when you read() into
a fixed buffer in a loop in a C program... your buffer is fixed in
memory and is almost guarenteed to be in the L1/L2 cache, which means
that the extra copy operation is very fast on a modern processor. It's
something like 12-16 GBytes/sec to the L1 cache on an Athlon 64, for
example, and 3 GBytes/sec uncached to main memory.
Consider the cpu time cost, then, of the local copy on a 2GB file...
the cpu time cost on an AMD64 is about 2/12 of one second. This is
the number mmap would have to beat.
As you can see by your timing results, even on your fastest box,
processing a file around that size is only going to incur 1-2 seconds
of real time overhead to do the extra buffer copy. 2 seconds is a hard
number to beat.
This is something you can calculate yourself. Time a dd from /dev/zero
crater# dd if=/dev/zero of=/dev/null bs=32k count=8192
268435456 bytes transferred in 0.244561 secs (1097620804 bytes/sec)
amd64# dd if=/dev/zero of=/dev/null bs=32k count=8192
268435456 bytes transferred in 0.066994 secs (4006846790 bytes/sec)
amd64# dd if=/dev/zero of=/dev/null bs=16m count=32
536870912 bytes transferred in 0.431774 secs (1243407512 bytes/sec)
Try it for different buffer sizes (16K through 16MB) and you will get
a feel for how the L1 and L2 caches effect copying bandwidth. These
numbers are reasonably close to the raw memory bandwidth available to
the cpu (and will be different depending on whether the buffer fits in
the L1 or L2 caches, or doesn't fit at all).
The mmap interface is not supposed to be more efficient, per say. Why
would it be? There are overheads involved with mapping the page table
entries and taking faults to map more. Even if you pre-mapped everything,
there are still overheads involved in populating the page table and
performing invlpg operations on the TLB to reload the entry, and for
large data sets there is overhead involved with removing page table
entries and invalidating the pte. On a modern cpu, where an L1 cache
copy is a two cycle streaming operation, the several hundred (or more)
cycles it takes to process a page fault or even just populate the
page table is equivalent to a lot of copied bytes.
This immediately puts mmap() at a disadvantage on a modern cpu, but of
course it also depends on what the data processing loop itself is
doing. If the data processing loop is sensitive to the L1 cache then
processing larger chunks of data is going to be make it more efficient,
and mmap() can certainly provide that where read() might require buffers
too large to fit comfortably in the L1/L2 cache. On the otherhand, if
the processing loop is relatively insensitive to the L1 cache (i.e. its
small), then you can afford to process the data in smaller chunks, like
16K, without any significant penalty.
mmap() is not designed to streamline large demand-page reads of data
sets much larger then main memory. mmap() works best for data that
is already cached in the kernel, and even then it still has a fairly
large hurdle to overcome vs a streaming read(). This is a HARDWARE
limitation. Drastic action would have to be taken in software to get
rid of this overhead (we'd have to use 4MB page table entries, which
come with their own problems).
The overhead required to manage a large mmap'd data set can skyrocket.
FreeBSD (and DragonFly) have heuristics that attempt to detect
sequential operations like this with mmap'd data and to depress the
page priority behind the read (so: read-ahead and depress-behind), and
this works, but it only mitigates the additional overhead some, it
doesn't get rid of it.
For linear processing of large data sets you almost universally want
to use a read() loop. There's no good reason to use mmap().
:=: read: 10.619u 23.814s 1:17.67 44.3% 62+274k 11255+0io 0pf+0w
:Well, now we are venturing into the domain of humans' subjective
:perception... I'd say, 12% is plenty, actually. This is what some people
:achieve by rewriting stuff in assembler -- and are proud, when it works
Nobody is going to stare at their screen for one minute and 17 seconds
and really care that something might take one minute and 27 seconds instead
of one minute and 17 seconds. That's subjective truth.
The type of test you want to do is this:
[read all data into memory]
[stop timing] -> print timing results
[process all data]
[stop timing] -> print timing results
Now you have something practical you can look at... you can look at the
I/O bandwidth required to bring the data into memory without the
complications of whatever processing you are doing on the data being
mixed in. *THEN* you can say something more definitive about the
kernel overhead required to get the data into memory first, because
you can definitely say what the 'bandwidth', or data rate, has been
achieved in getting the data from the disk or kernel caches into
your program's memory space (faulted in and everything, ready to access).
You could then compare that to the times required to do it in a mixed
environment (read-processing loop). If *THOSE* numbers are hugely
different then you can say something definitive about the relative
efficiency of the mixed mode processing verses just doing pure I/O,
for both read() and mmap() independantly.
:Put it into perspective -- 10-15% is usually the difference between
:the latest processor and the previous one. People are willing to pay
:hundreds of dollars premium...
15% is nothing anyone cares about except perhaps gamers. I certainly
couldn't care less about 15%. 50%, on the otherhand, is something
that I would care about. But upgrading isn't just a function of raw
cpu speed, it's also a function of general improvements in hardware
and hardware interfaces... usb, usb2, firewire, sata, and so forth.
:Besides, the differences can be higher. Here is from md5-ing a
:2097272832-bytes file over NFS (on a Gigabit network, no jumbo frames).
:The machine runs a FreeBSD-current on a single P4 2GHz:
: mmap1: 17.115u 16.106s 2:20.84 23.5% 5+166k 0+0io 253421pf+0w
: read1: 19.468u 12.179s 1:27.80 36.0% 4+163k 0+0io 0pf+0w
: mmap2: 17.214u 13.265s 2:13.75 22.7% 5+165k 1+0io 204842pf+0w
: read2: 19.142u 11.576s 1:20.22 38.2% 4+162k 0+0io 4pf+0w
:mmap is 87% slower (or read is 38% faster)! According to `systat -if',
:mmap was reading at about 13Mb/s, while read was consistently above
:If this mmap-associated penalty is removed, the applications can save
:some memory by not using the BUFSIZ (or bigger) buffers, and the
:systems can save the time and effort of shuffling the memory from
:kernel buffers into user space (and flushing the instruction and data
:caches). The difference can be big -- on a CPU bound machine the sum
:of user time and system time is much smaller with mmap. For example,
:on this Solaris box running on Sparc-900MHz md5-ing a 16061698048-byte
:file (FreeBSD behaves similarly on the P2 400MHz reported earlier):
: mmap: 215.290u 48.990s 7:18.81 60.2% 0+0k 0+0io 0pf+0w
: read: 184.240u 142.350s 5:46.31 94.3% 0+0k 0+0io 0pf+0w
: (264.28 vs. 326.59 CPU seconds)
:but read manages to saturate the CPU better -- 94% vs. 60% -- and win
:the "wall clock" race repeatedly...
I think this points to inefficiencies in NFS's getpages() interface over
its read() interface. The read() interface (for NFS) definitely has better
read-ahead characteristics. The NFS getpages() interface in FreeBSD
is about as primitive as it is possible to make it and still work, and
its only marginally better in DragonFly (we get rid of some KVM allocations
and deallocations). In fact, I don't even think the NFS getpages interface
uses the IOD's like the read interface does. I think it might actually be
a synchronous interface.
It would be nice if someone were to improve the NFS getpages interface.
I might do it myself, if I can find the time down the road.
firstname.lastname@example.org mailing list
To unsubscribe, send any mail to "email@example.com"