Re: read vs. mmap (or io vs. page faults)

From: Matthew Dillon (dillon_at_apollo.backplane.com)
Date: 06/22/04

  • Next message: Bob Johnson: "Re: [OT] Re: What's the best possible email failover solution"
    Date: Tue, 22 Jun 2004 12:09:10 -0700 (PDT)
    To: Mikhail Teterin <mi+kde@aldan.algebra.com>
    
    

        (current removed, but I'm leaving this on question@ since it contains
        some useful information).

    :This is, sort of, self-perpetuating -- as long as mmap is slower/less
    :reliable, applications will be hesitant to use it, thus there will be
    :little insentive to improve it. :-(

        Well, again, this is an incorrect perception. Your use of mmap() to
        process huge linear data sets is not what mmap() is best at doing, on
        *any* operating system, and not what people use mmap() for most of the
        time. There are major hardware related overheads to the use of mmap(),
        on *ANY* operating system, that cannot be circumvented. You have no
        choice but to allocate the pages for a page table, to populate the pages
        with pte's, you must invalidate the pages in the tlb whenever you modify
        a page table entry (e.g. invlpg instruction for IA32, which on a P2 is
        extremely expensive), and if you are processing huge data sets you also
        have to remove the page table entry from the page table when the
        underlying data page is reused due to the dataset being larger then
        main memory. There are overheads related to each of these issues, and
        overheads related the algorithms the operating system *MUST* use to
        figure out which pages to remove (on the fly) when the data set does
        not fit in main memory, and there are overheads related to the heuristics
        the operating system employs to try to predict the memory usage pattern
        to perform some read-ahead.

        These are hardware and software issues that cannot simply be wished away.
        No matter how much you want the concept of memory mapping to be 'free',
        it isn't. Memory mapping and management are complex operations for
        any operating system, always have been, and always will be.

    :I'd rather call attention to my slower -- CPU-bound boxes. On them, the
    :total CPU time spent computing md5 of a file is less with mmap -- by a
    :noticable margin. But because the CPU is underutilized, the elapsed "wall
    :clock" time is higher.
    :
    :As far as the cache-using statistics, having to do a cache-cache copy
    :doubles the cache used, stealing it from other processes/kernel tasks.

        But it is also not relevant for this case because the L2 cache is
        typically much larger (128K-2MB) then the 8-32K you might use for
        your local buffer. What you are complaining about here is going
        to wind up being mere microseconds over a multi-minute run.

        It's really important, and I can't stress this enough, to not simply
        assume what the performance impact of a particular operation will be
        by the way it feels to you. Your assumptions are all skewed... you
        are assuming that copying is always bad (it isn't), that copying is
        always horrendously expensive (it isn't), that memory mapping is always
        cheap (it isn't cheap), and that a small bit of cache pollution will have
        a huge penalty in time (it doesn't necessary, certainly not for a
        reasonably sized user buffer).

        I've already told you how to measure these things. Do me a favor and just
        run this dd on all of your FreeBSD boxes:

        dd if=/dev/zero of=/dev/null bs=32k count=8192

        The resulting bytes/sec that it reports is a good guestimate of the
        cost of a memory copy (the actual copy rate will be faster since the
        times include the read and write system calls, but it's still a reasonable
        basis). So in the case of my absolute fastest machine
        (an AMD64 3200+ tweaked up a bit):

        268435456 bytes transferred in 0.058354 secs (4600128729 bytes/sec)

        That means, basically, that it costs 1 second of cpu to copy 4.6 GBytes
        of data. On my slowest box, a C3 VIA Samuel 2 cpu (roughly equivalent
        to a P2/400Mhz):

        268435456 bytes transferred in 0.394222 secs (680924559 bytes/sec)

        So the cost is 1 second to copy 680 MBytes of data on my slowest box.

    :Here, again, is from my first comparision on the P2 400MHz:
    :
    : stdio: 56.837u 34.115s 2:06.61 71.8% 66+193k 11253+0io 3pf+0w
    : mmap: 72.463u 7.534s 2:34.62 51.7% 5+186k 105+0io 22328pf+0w

        Well, the cpu utilization is only 71.8% for the read case, so the box
        is obviously I/O bound already.

        The real question you should be asking is not why mmap is only using
        51.7% of the cpu, but why stdio is only using 71.8% of the cpu. If
        you want to make your processing program more efficient, 'fix' stdio
        first. You need to:

        (1) Figure out the rate at which your processing program reads data in
            the best case. You can do this by timing it on a data set that fits
            in memory (so no disk I/O is done). Note that it might be bursty,
            so the average rate along does not precisely quanity the amount of
            buffering that will be needed.

        (2) If your hard drive is faster then the datarate, then determine if
            the overhead of doing double-buffering is worth keeping the
            processing program populated with data on demand. The overhead
            of doing double buffering is something akin to:

            dd if=file bs=1mb | dd bs=32k > /dev/null

        (3) Figure out how much buffering is required to keep the processing
            program supplied with data (achieving either 100% cpu utilization or
            100% I/O utilization).

            #!/bin/csh
            #
            dd if=file bs=1mb | dd bs=32k | your_processing_program

                       ^^^^^^ ^^^^^ try different buffer sizes to try
                                            to achieve 100% cpu utilization or
                                            100% I/O utilization on the drive.

            time ./scriptfile

        (4) If this overhead is small enough (less then the 37% of available cpu
            you have in the stdio case), then you can use it to front-end your
            processing script and achieve an improvement, despite the extra
            copying that id does.

            (Again, in my last email I gave you the 'dd' lines that you can use
            to determine exactly what the copying overhead for a dataset would be,
            and gave you live examples showing that, usually, it's quite small
            compared to the total run time of a typical processing program).

        Don't just assume that copying is bad, or that extra stages are bad,
        because the reality is that they might not be in an I/O bound situation.
        You have to measure the actual overhead to see what the actual cost is.

        My backup script uses dd to double buffer for precisely this reason,
        though in my case I do it because 'dump' output it quite bursty and
        sometimes it blocks waiting for gzip when, really, it shouldn't have to.
        Here is a section out of my backup script:

            ssh $host -l operator $sshopts "dump ${level}auCbf 32 64 - $i" | \
                    dd obs=1m | dd obs=1m | gzip -6 > $file.tmp

        I would never, ever expect the operating system to buffer that much
        data ahead of a program, nor should the OS do that, so I do it myself.
        The cost is a pittance. I waste 1% of the cpu in order to gain about
        18% in real time by allowing dump to more fully utilize the disk it is
        dumping.

    :Or is P2 400MHz not modern? May be, but the very modern Sparcs, on which
    :FreeBSD intends to run are not much faster.

        A 400 MHz P2 is 1/3 as fast as the LOWEST END AMD XP cpu you can buy
        today, and 5-10 times slower then higher-end Intel and AMD cpus.
        I would say that that makes it 'not modern'.

        We aren't talking 15% here. We are talking 300%-1000%.

    := The mmap interface is not supposed to be more efficient, per say.
    := Why would it be?
    :
    :Puzzling question. Because the kernel is supplied with more information
    :-- it knows, that I only plan to _read_ from the memory (PROT_READ),
    :the total size of what I plan to read (mmap's len, optionally,
    :madvise's len), and (optionally), that I plan to read sequentially
    :(MADV_SEQUENTIONAL).

        Well, this is not correct. The kernel has just as much information
        when you use read().

        Furthermore, you are making the assumption that the kernel should
        read-ahead an arbitrary amount of data. It could very well be that
        the burstiness of your processing program requires a megabyte or more
        worth of read-ahead to keep the cpu saturated.

        The kernel will never do this, because dedicating that much memory to
        a single I/O stream is virtually guarenteed to be detrimental to the
        rest of the system (everything else running on the system).

        The kernel will not do this, but you certainly can, either by
        double-buffering the stream or by following Julian's excellent suggestion
        to fork() a helper thread to read that far ahead.

    :Mmap also needs no CPU data-cache to read. If the device is capable of
    :writing to memory directly (DMA?), the CPU does not need to be involved
    :at all, while with read the data still has to go from the DMA-filled
    :kernel buffer to the application buffer -- there being two copies of it
    :in cache instead of none for just storing or one copy for processing.

        In most cases the CPU is not involved at all when you mmap() data until
        you access it via the mmap(). However, that does not mean that the memory
        subsystem is not involved. The CPU must still load the data you access
        into the L1/L2 caches from main memory when you access it, so the memory
        overhead is still there and still (typically) 5 times greater then the
        additional memory overhead required to do a buffer copy in the read()
        case. When you add in the overhead of processing the data, which is
        typically 10-50 times the cost of reading it in the first place, then
        the 'waste' from the extra buffer copy winds up being in the noise.

        So, as I said in my previous email, it comes down to how much it costs
        to do a local copy within the L2 cache (the read() case), verses how
        much extra overhead is involved in the mmap case. And, as I stated
        previously, L1 and L2 cache bandwidth is so high these days that it
        really doesn't take all that much overhead to match (and then exceed)
        the time it takes to do the local copy.

    :Also, in case of RAM shortage, mmap-ed pages can be just dropped, while
    :the too large buffer needs to be written into swap.

        Huh? No, that isn't true. Your too-large buffer might still only be
        a megabyte, whereas your mmap()'d data might be a gigabyte. Since you
        are utilizing the buffer over and over again its pages are NOT likely
        to ever be written to swap.

    :And mmap requires no application buffers -- win, win, and win. Is there
    :an inherent "lose" somewhere, I don't see? Like:

        Again, you aren't listening to what I said about how the L1/L2 cache
        works. You really have to listen. APPLICATION BUFFERS WHICH EASILY
        FIT IN THE L2 CACHE COST VIRTUALLY NOTHING ON A MODERN CPU! I even
        gave you a 'dd' test you could perform on FreeBSD to measure the cost.
        It is almost impossible to beat 'virtually nothing'.

    :A database, that returns results 15%, nay, even 5% faster is also a
    :better database.
    :...
    :What are we arguing about? Who wouldn't take a 2.2GHz processor over a
    :2GHz one -- other things being equal -- and they are?
    :..
    : -mi

        Which is part of the problem. You are not taking into account cost
        considerations when you say that. You are paying a premium to buy
        a cpu that is only 15% faster. If it were free, or cost a pittance,
        I would take the 2.2GHz cpu. But it isn't free, and for a high-end cpu
        15% can be $400 (or more) that's why it generally isn't worth it for
        a mere 15%. The money can be spent on other things that are just
        as important: memory, another disk (double your disk throughput),
        GigE network card, even a whole new machine so you now have two
        slightly slower machines (200%) rather then one slightly faster machine
        (115%).

                                            -Matt
                                            Matthew Dillon
                                            <dillon@backplane.com>

    _______________________________________________
    freebsd-questions@freebsd.org mailing list
    http://lists.freebsd.org/mailman/listinfo/freebsd-questions
    To unsubscribe, send any mail to "freebsd-questions-unsubscribe@freebsd.org"


  • Next message: Bob Johnson: "Re: [OT] Re: What's the best possible email failover solution"

    Relevant Pages

    • Re: Memory Limit Imposed on Oracle by Windows?
      ... On UNIX system, Oracle always refer ... The 2GB limit referred to on unix is file size, not memory size. ... the habit of figuring out what a proper buffer size is. ... For example, if your cpu is ...
      (comp.databases.oracle.server)
    • Re: correct use of bus_dmamap_sync
      ... Perform any synchronization required after an update of memory by the CPU ... but prior to DMA write operations. ... buffer and PREWRITE /after/ the CPU writes to the buffer, ...
      (freebsd-hackers)
    • Re: How to reclaim memory without GC.start
      ... running loop with a local object (buffer) which eats up memory. ... GC.start on every iteration will reclaim it, but that eats the cpu. ... Is there a way to deallocate the memory used by the temporary objects? ...
      (comp.lang.ruby)
    • Re: Memory usage: buffer and cache
      ... Buffer is a logical cache maintained by the operating system in the ... main memory while cache is actually a physical hardware that the cpu ... uses to decrease the effective memory access time. ...
      (Debian-User)
    • Re: Discovering variable types...
      ... >- but I suppose MS expect us to use wrappers ... memory allocations for your variables from disk as well. ... >They most certainly are of fixed size, changing the size of a String ... >>me to keep buffer size and current postion right in the memory block. ...
      (comp.lang.pascal.delphi.misc)