Interesting cross-fertilization with DfBSD

From: Andre Oppermann (andre_at_freebsd.org)
Date: 03/31/04

  • Next message: Jacques A. Vidrine: "Re: Last NSS commit is very dangerous"
    Date: Wed, 31 Mar 2004 20:23:52 +0200
    To: current@freebsd.org
    
    
    

    
    

    attached mail follows:


    Date: Tue, 30 Mar 2004 14:58:13 -0800 (PST)
    
    

        The recent PIPE work adapted from Alan Cox's work in FreeBSD-5 has really
        lit a fire under my seat. It's amazing how such a simple concept can
        change the world as we know it :-)

        Originally the writer side of the PIPE code was mapping the supplied
        user data into KVM and then signalling the reader side. The reader
        side would then copy the data out of KVM.

        The concept Alan codified is quite different: Instead of having the
        originator map the data into KVM, simply supply an array of vm_page_t's
        to the target and let the target map the data into KVM. In the case
        of the PIPE code, Alan used the SF_BUF API (which was originally developed
        by David Greenman for the sendfile() implementation) on the target side
        to handle the KVA mappings.

        Seems simple, eh? But Alan got an unexpectedly huge boost in performance
        on IA32 when he did this. The performance boost turned out to be due
        to two facts:

            * Avoiding the KVM mappings and the related kernel_object manipulations
              required for those mappings saves a lot of cpu cycles when all you
              want is a quick mapping into KVM.

            * On SMP, KVM mappings generated IPIs to all cpus in order to
              invalidate the TLB. By avoiding KVM mappings all of those IPIs
              go away.

            * When the target maps the page, it can often get away with doing
              a simple localized cpu_invlpg(). Most targets will NEVER HAVE TO
              SEND IPIs TO OTHER CPUS. The current SF_BUF implementation still
              does send IPIs in the uncached case, but I had an idea to fix that
              and Alan agrees that it is sound... and that is to store a cpumask
              in the sf_buf so a user of the sf_buf only invalidates the cached
              KVM mapping if it had not yet been accessed on that particular cpu.

            * For PIPEs, the fact that SF_BUF's cached their KVM mappings
              reduced the mapping overhead almost to zero.

        Now when I heard about this huge performance increase I of course
        immediately decided that DragonFly needed this feature to, and so we
        now have it for DFly pipes.

                            Light Bulb goes off in head

        But it also got me to thinking about a number of other sticky issues
        that we face, especially in our desire to thread major subsystems (such
        as Jeff's threading of the network stack and my desire to thread VFS),
        and also issues related to how to efficiently pass data between threads,
        and how to efficiently pass data down through the I/O subsystem.

        Until now, I think everyone here and in FreeBSD land were stuck on the
        concept of the originator mapping the data into KVM instead of the
        target for most things. But Alan's work has changed all that.

        This idea of using SF_BUF's and making the target responsible for mapping
        the data has changed everything. Consider what this can be used for:

        * For threaded VFS we can change the UIO API to a new API (I'll call it
          XIO) which passes an array of vm_page_t's instead of a user process
          pointer and userspace buffer pointer.

          So 'XIO' would basically be our implementation of target-side mappings
          with SF_BUF capabilities.

        * We can do away with KVM mappings in the buffer cache for the most
          prevalient buffers we cache... those representing file data blocks.
          We still need them for meta-data, and a few other circumstances, but
          the KVM load on the system from buffer cache would drop by like 90%.

        * We can use the new XIO interface for all block data referencse from
          userland and get rid of the whole UIO_USERSPACE / UIO_SYSSPACE mess.
          (I'm gunning to get rid of UIO entirely, in fact).

        * We can use the new XIO interface for the entire I/O path all the way
          down to busdma, yet still retain the option to map the data if/when
          we need to. I never liked the BIO code in FreeBSD-5, this new XIO
          concept is far superior and will solve the problem neatly in DragonFly.

        * We can eventually use XIO and SF_BUF's to codify copy-on-write at
          the vm_page_t level and no longer stall memory modifications to I/O
          buffers during I/O writes.

        * I will be able to use XIO for our message passing IPC (our CAPS code),
          making it much, much faster then it currently is. I may do that as
          a second step to prove-out the first step (which is for me to create
          the XIO API).

        * Once we have vm_page_t copy-on-write we can recode zero-copy TCP
          to use XIO, and won't be a hack any more.

        * XIO fits perfectly into the eventual pie-in-the-sky goal of
          implementing SSI/Clustering, because it means we can pass data
          references (vm_page_t equivalents) between machines instead of
          passing the data itself, and only actually copy the data across
          on the final target. e.g. if on an SSI system you were to do
          'cp file1 file2', and both file1 and file2 are on the same filesystem,
          the actual *data* transfer might only occur on the machine housing
          the physical filesystem and not on the machine doing the 'cp'. Not
          one byte. Can you imagine how fast that would be?

        And many other things. XIO is the nutcracker, and the nut is virtually
        all the remaining big-ticket items we need to cover DragonFly.

        This is very exciting to me.

                                                        -Matt

    
    

    _______________________________________________
    freebsd-current@freebsd.org mailing list
    http://lists.freebsd.org/mailman/listinfo/freebsd-current
    To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"


  • Next message: Jacques A. Vidrine: "Re: Last NSS commit is very dangerous"

    Relevant Pages

    • Re: Curling iron?
      ... > Thanks, Alan. ... > "at" the pipe. ... He steamed the wood (must have been mahogany ... - The wood feels stiff and you flex it just a bit beyond so you can feel the stiffness. ...
      (rec.music.makers.builders)
    • Re: Sat eve 1/5 smoke(s)
      ... Please don't wear that red bikini again. ... Here it's RB Plug in an Alan Kuehl prince. ... did our club pipe. ...
      (alt.smokers.pipes)
    • Re: Nor Cal pipe show, got something extra!
      ... Alan, 'round these parts we call a woman like that a keeper. ... Beautiful cabinet and such a thoughtful and generous gift. ... talked with another fellow I had seen a few days before at Edwards Pipe ... A locking tobacco safe, ostensibly late 1800's vintage. ...
      (alt.smokers.pipes)
    • Re: nice job alan
      ... Top 5 was a pipe dream... ... >> does set good times in races which mean nothing. ... Some good runners, including Baala and Komen ... >> finish for alan. ...
      (rec.running)
    • Re: keyboard EOF handling [Re: (warn) isnt doing what I expect it to]
      ... > with using the function process-send-eof. ... > If the target is a pipe, this sends a ^D down the pipe. ... > If the target is a pty, ...
      (comp.lang.lisp)