Re: FreeBSD mail list etiquette

From: Matthew Dillon (dillon_at_apollo.backplane.com)
Date: 10/26/03

  • Next message: Robert Watson: "Synchronization philosophy (was: Re: FreeBSD mail list etiquette)"
    Date: Sat, 25 Oct 2003 21:01:08 -0700 (PDT)
    To: Robert Watson <rwatson@freebsd.org>
    
    

    :> It's a lot easier lockup path then the direction 5.x is going, and
    :> a whole lot more maintainable IMHO because most of the coding doesn't
    :> have to worry about mutexes or LORs or anything like that.
    :
    :You still have to be pretty careful, though, with relying on implicit
    :synchronization, because while it works well deep in a subsystem, it can
    :break down on subsystem boundaries. One of the challenges I've been
    :bumping into recently when working with Darwin has been the split between
    :their Giant kernel lock, and their network lock. To give a high level
    :summary of the architecture, basically they have two Funnels, which behave
    :similarly to the Giant lock in -STABLE/-CURRENT: when you block, the lock
    :is released, allowing other threads to enter the kernel, and regained when
    :the thread starts to execute again. They then have fine-grained locking
    :for the Mach-derived components, such as memory allocation, VM, et al.

        I recall a presentation at BSDCon that mentioned that... yours I think.

        The interfaces we are contemplating for the NETIF (at the bottom)
        and UIPC (at the top) are different. We probably won't need to use
        any mutexes to queue incoming packets to the protocol thread, we will
        almost certainly use an async IPI message to queue a message holding the
        packet if the protocol thread is on a different cpu. On the same cpu
        it's just a critical section to interlock the queueing operation against
        the protocol thread. Protocol packet output to NETIF would use the
        same methodology... asynch IPI message if the NETIF is on another cpu,
        critical section if it is on the current cpu.

        The protocol itself will change from a softint to a normal thread, or
        perhaps a thread at softint priority. The softint is already a thread
        but we would separate each protocol into its own thread and have an
        ability to create several threads for a single protocol (like TCP) when
        necessary to take advantage of multiple cpus.

        On the UIPC side we have a choice of using a mutex to lock the socket
        buffer, or passing a message to the protocol thread responsible for
        the socket buffer (aka PCB). There are tradeoffs for both situations
        since if this is related to a write() it winds up being a synchronous
        message. Another option is to COW the memory but that might be too
        complex. Smaller writes could simply copyin() the data as an option,
        or we could treat the socket buffer as a FIFO which would allow the
        system call UIPC interface to append to it without holding any locks
        (other then a memory barrier after the copy and before updating the
        index), then simply send a kick-off message to the protocol thread
        telling it that more data is present.

    :Deep in a particular subsystem -- say, the network stack, all works fine.
    :The problem is at the boundaries, where structures are shared between
    :multiple compartments. I.e., process credentials are referenced by both
    :"halves" of the Darwin BSD kernel code, and are insufficiently protected
    :in the current implementation (they have a write lock, but no read lock,
    :so it looks like it should be possible to get stale references with
    :pointers accessed in a read form under two different locks). Similarly,
    :there's the potential for serious problems at the surprisingly frequently
    :occuring boundaries between the network subsystem and remainder of the
    :kernel: file descriptor related code, fifos, BPF, et al. By making use of
    :two large subsystem locks, they do simplify locking inside the subsystem,
    :but it's based on a web of implicit assumptions and boundary
    :synchronization that carries most of the risks of explicit locking.

        Yes. I'm not worried about BPF, and ucred is easy since it is
        already 95% of the way there, though messing with ucred's ref count
        will require a mutex or an atomic bus-locked instruction even in
        DragonFly! The route table is our big issue. TCP caches routes so we
        can still BGL the route table and achieve 85% of the scaleable
        performance so I am not going to worry about the route table initially.

        An example with ucred would be to passively queue it to a particular cpu
        for action. Lets say instead of using an atomic bus-locked instruction
        to manipulate ucred's ref count, we instead send a passive IPI to the
        cpu 'owning' the ucred, and that ucred is otherwise read-only. A
        passive IPI, which I haven't implemented yet, is simply queueing an
        IPI message but not actually generating an interrupt on the target cpu
        unless the CPU->CPU software IPI message FIFO is full, so it doesn't
        actually waste any cpu cycles and multiple operations can be executed
        in-batch by the target. Passive IPIs can be used for things
        that do not require instantanious action and both bumping and releasing
        ref counts can take advantage of it. I'm not saying that is how
        we will deal with ucred, but it is a definite option.

    :It's also worth noting that there have been some serious bugs associated
    :with a lack of explicit synchronization in the non-concurrent kernel model
    :used in RELENG_4 (and a host of other early UNIX systems relying on a
    :single kernel lock). These have to do with unexpected blocking deep in a
    :function call stack, where it's not anticipated by a developer writing
    :source code higher in the stack, resulting in race conditions. In the

        I've encountered this with softupdates, so I know what you mean.
        softupdates (at least in 4.x) is extremely sensitive to blocking in
        places where it doesn't expect blocking to happen. My free() code was
        occassionally (and accidently) blocking in an interrupt thread waiting
        on kernel_map (I've already removed kmem_map from DragonFly), and this
        was enough to cause softupdates to panic in its IO completion rundown
        once in a blue moon due to assumptions on its lock 'lk'.

        Synchronization is a bigger problem in 5.x then it is in DragonFly because
        in DragonFly most of the work is shoved over to the cpu that 'owns' the
        data structure via an async IPI. e.g. when you want to schedule thread X
        on cpu 1 and thread X is owned by cpu 2, cpu 1 will send an asynch
        IPI to cpu 2 and cpu 2 will actually do the scheduling. If the cpuid
        changes during the message transit cpu 2 will simply chase the owning cpu,
        forwarding it along. It doesn't matter if the cpuid is out of synch,
        in fact! You don't even need a memory barrier. Same goes for the slab
        allocator... DragonFly does not mess with the slab allocated by another
        cpu, it forwards the free() request to the other cpu instead.

        For a protocol, a protocol thread will own a PCB, so the PCB will be
        'owned' by the cpu the protocol thread is on. Any manipulation of the
        PCB must occur on that cpu or otherwise be very carefully managed
        (e.g. FIFO rindex/windex for the socket buffer and a memory barrier).
        Our intention is to encapsulate most operations as messages to the
        protocol thread owning the PCB.

    :past, there have been a number of exploitable security vulnerabilities due
    :to races opened up in low memory conditions, during paging, etc. One
    :solution I was exploring was using the compiler to help track the
    :potential for functions to block, similar to the const qualifier, combined
    :with blocking/non-blocking assertions evaluated at compile-time. However,
    :some of our current APIs (M_NOWAIT, M_WAITOK, et al) make that approach
    :somewhat difficult to apply, and would have to be revised to use a
    :compiler solution. These potential weaknesses very much exist in an
    :explicit model, but with explicit locking, we have a clearer notion of how
    :to express assertions.

        DragonFly is using its LWKT messaging API to abstract blocking verses
        non-blocking. In particular, if a client sends a message using an
        asynch interface it isn't supposed to block, but can return EASYNC if it
        wound up queueing the message due to not being able to execute it
        synchronous without blocking. If a client sends a message using a
        synchronous messaging interface then the client is telling the
        messaging subsystem that it is ok to block.

        This combined with the fact that we are using critical sections and
        per-cpu globaldata caches that do not require mutexes to access allows
        code to easily determine whether something might or might not block,
        and the message structure is a convenient placemark to queue and
        return EASYNC deep in the kernel if something would otherwise block
        when it isn't supposed to.

        We also have the asynch IPI mechanism and a few other mechanisms at
        our disposal and these cover a surprisingly large number of situations
        in the system. 90% of the 'not sure if we might block' problem
        is related to scheduling or memory allocation and neither of those
        subsystems needs to use extranious mutexes, so managing the blocking
        conditions is actually quite easy.

    :In -CURRENT, we make use of thread-based serialization in a number of
    :places to avoid explicit synchronization costs (such as in GEOM for
    :processing work queues), and we should make more use of this practice.
    :I'm particularly interested in the use of interface interrupt threads
    :performing direct dispatch as a means to maintain interface ordering of
    :packets coming in network interfaces while allowing parallelism in network
    :processing (you'll find this in use in Sam's netperf branch currently).
    :
    :Robert N M Watson FreeBSD Core Team, TrustedBSD Projects
    :robert@fledge.watson.org Network Associates Laboratories

        I definitely think that -current should explore a greater roll for
        threading subsystems. Remember that many operations can be done
        asynchronously and thus do not actually require synchronous context
        switches or blocking. A GEOM strategy routine is a good example, since
        it must perform I/O and I/O *ALWAYS* blocks or takes an interrupt
        at some point. However, you need to be careful because not all
        operations truely need to be run in a threaded subsystem's thread
        context. This is why DragonFly's LWKT messaging subsystem uses the
        Amiga's BeginIo abstraction for dispatching a message, which allows
        the target port to execute messages synchronously in the context of
        the caller if it happens to be possible to do so without blocking.

        The advantage of this is that we can start out by always queueing the
        message (thereby guarenteeing that queue mode operation will always
        be acceptable), and then later on we can optimize paricular messages
        (such as read()'s that are able to lock and access the VM object's
        page cache without blocking, in order to avoid switching to a
        filesystem thread unnecessarily).

        I'm sure we will hit issues but so far it has been smooth sailing.

                                            -Matt
                                            Matthew Dillon
                                            <dillon@backplane.com>
    _______________________________________________
    freebsd-hackers@freebsd.org mailing list
    http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
    To unsubscribe, send any mail to "freebsd-hackers-unsubscribe@freebsd.org"


  • Next message: Robert Watson: "Synchronization philosophy (was: Re: FreeBSD mail list etiquette)"

    Relevant Pages