Re: Network Stack Locking

From: Robert Watson (rwatson_at_freebsd.org)
Date: 05/21/04

  • Next message: Julian Elischer: "Re: atomic reference counting primatives."
    Date: Fri, 21 May 2004 13:23:51 -0400 (EDT)
    To: Matthew Dillon <dillon@apollo.backplane.com>
    
    

    On Thu, 20 May 2004, Matthew Dillon wrote:

    > It should be noted that the biggest advantages of the distributed
    > approach are (1) The ability to operate on individual PCBs without
    > having to do any token/mutex/other locking at all, (2) Cpu locality
    > of reference in regards to cache mastership of the PCBs and related data,
    > and (3) avoidance of data cache pollution across cpus (more cpus ==
    > better utilization of individual L1/L2 caches and far greater
    > scaleability). The biggest disadvantage is the mandatory thread switch
    > (but this is mitigated as load increases since each thread can work on
    > several PCBs without further switches, and because our thread scheduler
    > is extremely light weight under SMP conditions). Messaging passing
    > overhead is very low since most operations already require some sort of
    > roll-up structure to be passed (e.g. an mbuf in the case of the network).

    My primary concern with this approach (and the reason I'm taking somewhat
    of a "wait and see what happens" attitude) is the level of inter-component
    incestuousness (referred to elsewhere in this thread). At particular
    layers in the stack -- the PCBs are probably the best example -- I see the
    opportunity for this sort of per-CPU unsynchronized access offering a very
    clean and uncomplicated approach.

    However, I'm concerned that along many of the total end-to-end paths,
    there are a moderate number of pieces that will require traditional
    synchronization or extensive re-writing: the route table, host cache, a
    variety of "processing" packages such as netgraph, IPSEC, et al. None of
    that suggests that the per-cpu synchronization-free access in a thread
    shouldn't be applied, but I'd like to see it demonstrated to be a useful
    technique in a more broad sense. One of the key implied benefits of the
    approach is that it allows you to avoid significant rewriting costs for
    existing code, which is appealing, but less appealing if it doesn't fall
    out in the general case.

    The other concern I have is whether the message queues get deep or not:
    many of the benefits of message queues come when the queues allow
    coallescing of context switches to process multiple packets. If you're
    paying a context switch per packet passing through the stack each time you
    cross a boundary, there's a non-trivial operational cost to that. So what
    I'd like to see are the numbers that suggest, on a pretty functional
    sample stack, that you get at least an interesting level of queuing and
    therefore effective coallescing of synchronization. I've started looking
    at similar issues in the type-specific mbuf queues in the FreeBSD kernel
    -- additional context switches are expensive and best avoided even if you
    use explicit synchronization primitives such as mutexes.

    > In anycase, if you are seriously considering any sort of distributed
    > methodology you should also consider formalizing a messaging passing
    > API for FreeBSD. Even if you don't like our LWKT messaging API, I
    > think you would love the DFly IPI messaging subsystem and it would be
    > very easy to port as a first step. We use it so much now in DFly
    > that I don't think I could live without it. e.g. for clock distribution,
    > interrupt distribution, thread/cpu isolation, wakeup(), MP-safe messaging
    > at higher levels (and hence packet routing), free()-return-to-
    > originating-cpu (mutexless slab allocator), SMP MMU synchronization
    > (the basic VM/pte-race issue with userland brought up by Alan Cox),
    > basic scheduler operations, signal(), and the list goes on and on.
    > In DFly, IPI messaging and message processing is required to be MP
    > safe (it always occurs outside the BGL, like a cpu-localized fast
    > interrupt), but a critical section still protects against reception
    > processing so code that uses it can be made very clean.

    As someone who's worked with Darwin and other Mach-derived operating
    systems, I see the clear appeal of message passing systems, as I think
    we've discussed in other forums. They offer substantially interesting
    benefits from a security perspective also as they offer more clean
    separation between components, especially userspace and the kernel.
    However, based on past experience with such systems, I'm also very
    cautious about the notion. The increased level of separation between
    components can also make it harder to understand the interactions between
    components in a debugging sense: for example, if your stack trace in the
    TCP code only goes up to the queue receive primitive, the debugger can't
    simply tell you what code originated the mbuf.

    In the past, I've explored binding stack traces to messages in message
    passing systems when operating in debugging mode so that the debugger
    walks up to the message queue, and can then follow the stack trace from
    the message to understand more about the calling context. I've also used
    this on FreeBSD in userspace -- we have local modifications to allow the
    kernel to attack stack traces of the sending process to messages passed
    over UNIX domain sockets so that the receiving code can grab the stack
    trace as ancillary data.

    The trick, though, is to make sure you're not just substituting message
    queue operations and context switches for mutexes, because those both have
    a moderate cost. Many of the benefits come in reducing explicit
    synchronization and then amortizing the context switch cost over multiple
    instances, which helps with the cache and many other things. So something
    I'd very much like to see out of the dfbsd prototype code is a set of
    measurements on queue depth at the hand-off points between layers, and
    statistics on #queue operations, synchronization points, etc, amortized
    over multiple deliveries.

    Robert N M Watson FreeBSD Core Team, TrustedBSD Projects
    robert@fledge.watson.org Senior Research Scientist, McAfee Research

    _______________________________________________
    freebsd-arch@freebsd.org mailing list
    http://lists.freebsd.org/mailman/listinfo/freebsd-arch
    To unsubscribe, send any mail to "freebsd-arch-unsubscribe@freebsd.org"


  • Next message: Julian Elischer: "Re: atomic reference counting primatives."

    Relevant Pages