Re: Network Stack Locking

From: Matthew Dillon (dillon_at_apollo.backplane.com)
Date: 05/21/04

  • Next message: M. Warner Losh: "Re: atomic reference counting primatives."
    Date: Thu, 20 May 2004 18:03:26 -0700 (PDT)
    To: Robert Watson <rwatson@freebsd.org>
    
    

        It's my guess that we will be able to remove the BGL from large
        portions of the DFly network stack sometime late June or early July,
        after USENIX, at which point it will be possible to test SMP aspects of
        the localized cpu distribution method. Right now the network stack is
        still under the BGL (as is most of the system, our approach to MP is
        first to isolate and localize the conflicting subsystems, then to release
        the BGL for that subsystem's thread(s)).

        It should be noted that the biggest advantages of the distributed
        approach are (1) The ability to operate on individual PCBs without
        having to do any token/mutex/other locking at all, (2) Cpu locality
        of reference in regards to cache mastership of the PCBs and related data,
        and (3) avoidance of data cache pollution across cpus (more cpus ==
        better utilization of individual L1/L2 caches and far greater
        scaleability). The biggest disadvantage is the mandatory thread switch
        (but this is mitigated as load increases since each thread can work on
        several PCBs without further switches, and because our thread scheduler
        is extremely light weight under SMP conditions). Messaging passing
        overhead is very low since most operations already require some sort of
        roll-up structure to be passed (e.g. an mbuf in the case of the network).

        We are running the full bore threaded, distributed network stack even
        on UP systems now (meaning: message passing and thread switching still
        occurs even though there is only one target thread for a particular
        protocol). We have done fairly significant testing on GigE LANs and
        have not noticed any degredation in network performance so we are
        certain we are on the right track.

        I do not expect cpu balancing to be all that big an issue, actually,
        especially due to the typically short lived connection life that occurs
        in these scenarios. But mutex avoidance is *REALLY* *HUGE* if you are
        processing a lot of TCP connections in parallel due to the small quantums
        of work involved.

        In anycase, if you are seriously considering any sort of distributed
        methodology you should also consider formalizing a messaging passing
        API for FreeBSD. Even if you don't like our LWKT messaging API, I
        think you would love the DFly IPI messaging subsystem and it would be
        very easy to port as a first step. We use it so much now in DFly
        that I don't think I could live without it. e.g. for clock distribution,
        interrupt distribution, thread/cpu isolation, wakeup(), MP-safe messaging
        at higher levels (and hence packet routing), free()-return-to-
        originating-cpu (mutexless slab allocator), SMP MMU synchronization
        (the basic VM/pte-race issue with userland brought up by Alan Cox),
        basic scheduler operations, signal(), and the list goes on and on.
        In DFly, IPI messaging and message processing is required to be MP
        safe (it always occurs outside the BGL, like a cpu-localized fast
        interrupt), but a critical section still protects against reception
        processing so code that uses it can be made very clean.

                                                    -Matt

    :- They enable net.isr.enable by default, which provides inbound packet
    :...
    : consider at least some aspects of Jeffrey Hsu's work on DragonFly
    : to explore providing for multiple netisr's bound to CPUs, then directing
    : traffic based on protocol aware hashing that permits us to maintain
    : sufficient ordering to meeting higher level protocol requirements while
    : avoiding the cost of maintaining full ordering. This isn't something we
    : have to do immediately, but exploiting parallelism requires both
    : effective synchronization and effective balancing of load.
    :
    : In the short term, I'm less interested in the avoidance of
    : synchronization of data adopted in the DragonFly approach, since I'd
    : like to see that approach validated on a larger chunk of the stack
    : (i.e., across the more incestuous pieces of the network stack), and also
    :...
    : benefits (such as a very strong assertion model). However, as aspects
    : of the DFBSD approach are validated (or not, as the case may be), we
    : should consider adopting things as they make sense. The approaches
    : offer quite a bit of promise, but are also very experimental and will
    : require a lot of validation, needless to say. I've done a little bit of
    : work to start applying the load distribution approach on FreeBSD, but
    : need to work more on the netisr infrastructure before I'll be able to
    : evaluate its effectiveness there.

    _______________________________________________
    freebsd-arch@freebsd.org mailing list
    http://lists.freebsd.org/mailman/listinfo/freebsd-arch
    To unsubscribe, send any mail to "freebsd-arch-unsubscribe@freebsd.org"


  • Next message: M. Warner Losh: "Re: atomic reference counting primatives."