Network Stack Locking

From: Robert Watson (rwatson_at_FreeBSD.org)
Date: 05/20/04

  • Next message: Julian Elischer: "atomic reference counting primatives."
    Date: Thu, 20 May 2004 16:30:26 -0400 (EDT)
    To: arch@FreeBSD.org
    
    

    1.5 line summary:

      This is an e-mail about the on-going network stack locking and contains
      largely technical stuff.

    Executive summary:

      The high level view, for those less willing to wade through a greater
      level of detail, is that we have a substantial work in progress with a
      lot of our bases covered, and that we're looking for broader exposure
      for the work. We've been merging smaller parts of the work (supporting
      infrastructure, fine-grained locking for specific leaf dependencies),
      and are starting to think about larger scale merging over the next
      month or two. There are some known serious issues in the current work,
      but we've also identified some areas that need attention outside of the
      stack in order to make serious progress on merging. There are also some
      important tasks that require owners moving forward, and a solicitation
      for those areas. I don't attempt to capture everything, in particular
      things like locking strategies in this e-mail. You will find patch URLs
      and perforce references.

    Body:

    As many of you are aware, I've become the latest inheritor of the omnibus
    "Network Stack Locking" task of SMPng. This work has a pretty long
    history that I won't attempt to go into here, other than to observe that:

    - This is a product of the adoption of the SMPng approach a few years ago
      by the FreeBSD Project for the FreeBSD 5.x line. This approach
      attempts to address a lack of kernel parallelism and preemption, as well
      as generally formalizing synchronization, adopting architectural
      properties such as interrupt threads and a more general use of threads
      in the kernel, etc.

    - The vast majority of work that will be discussed in this e-mail is the
      product of significant contributions of others, including: Jonathan
      Lemon, Jennifer Yang, Jeffrey Hsu, and Sam Leffler, and a large number
      of other contributors (many of whom are named in recent status
      reports, but some of whom I've inevitably accidentally omitted and would
      be happy to be reminded of via private e-mail!).

    The goal of this e-mail is to provide a bit of high level information
    about what is going on to increase awareness, solicit involvement in a
    variety of areas, and throw around words like "merge schedule". Warning:
    this is a work in progress, and you will find rough parts. This is being
    worked on actively, but by bringing this up during the process, we can
    improve the work. If you see things that scare you, that's a reasonable
    response.

    Now into the details:

    Those following the last few status reports will know that recent work has
    focused in the following areas:

    - Introducing and refining data based locking for the top levels of the
      network stack (sockets, socket buffers, et al).

    - Refining and testing locking for lower pieces of the stack that already
      have locking.

    - Locking for UNIX domain sockets, FIFOs, etc.

    - Iterating through pseudo-interfaces and network interfaces to identify
      and correct locking problems.

    - Allow Giant to be conditionally acquired across the entire stack using a
      Giant Toggle Switch.

    - Address interactions with tightly coupled support infrastructure for the
      stack, including the MAC Framework, kqueue, sigio, select() general
      signaling primitives, et al.

    - Investigating and in many cases locking of less popular/less widely used
      stack components that were previously unaddressed, such as IPv6,
      netatalk, netipx, et al.

    - Some local changes used to monitor and assert locks at a finer
      granularity than in the main tree. Specifically, sampling of callouts
      and timeouts to measure what we're grabbing Giant for, and in certain
      branches, the addition of a great many assertions.

    This work is occurring in a number of Perforce branches. The primary
    branch that is actively worked on is "rwatson_netperf", which may be found
    at the following patch:

      //depot/users/rwatson/netperf/...

    Additional work is taking place to explore socket locking issues in:

      //depot/users/rwatson/net2/...

    A number of other developers have branches off of these branches to
    explore locking for particular subsystems. There are also some larger
    unintegrated patch sets for data-based NFS locking, fixing the user space
    build, etc. You can find a non-Perforce version at:

      http://www.watson.org/~robert/freebsd/netperf/

    This includes a basic change log and incrementally generated patches, work
    sets, etc. Perforce is the preferred way to get to the work as it
    provides easier access to my working notes, the ability to maintain local
    changes, get the most recent version, etc. I try to drop patches fairly
    regularly -- several times a week against HEAD, but due to travel to
    BSDCan, I'm about two weeks behind. I hope to make substantial headway
    this weekend in updating the patch set and integrating a number of recent
    socket locking changes from various work branches.

    This work is currently a work in progress, and has a number of known
    issues, including some lock order reversal problems, known deficiencies
    in socket locking coverage of socket variables, etc. However, it's been
    being reviewed and worked on by an increasingly broad population of
    FreeBSD developers, so I wanted to move to a more general patch posting
    process and attempt to identify additional "hired hands" for areas that
    require additional work. Here are current known tasks and current owners:

    Task Developer
    ---- ---------
    Sockets Robert Watson
    Synthetic network interfaces Robert Watson
    Netinet6 George Neville-Neil
    Netatalk Robert Watson
    Netipx Robert Watson
    Interface Locking Max Laier, Luigi Rizzo,
                                            Maurycy Pawlowski-Wieronski,
                                            Brooks Davis
    Routing Cleanup Luigi Rizzo
    KQueue (subsystem lock) Brian Feldman
    KQueue (data locking) John-Mark Gurney
    NFS Server (subsystem lock) Robert Watson
    NFS Server (data locking) Rick Macklem
    SPPP Roman Kurakin
    Userspace build Roman Kurakin
    VFS/fifofs interactions Don Lewis
    Performance measurement Pawel Jakub Dawidek

    And of course, I can't neglect to mention the on-going work of Kris
    Kennaway to test out these changes on high-load systems :-).

    Some noted absences in the above, and areas where I'd like to see
    additional people helping out are:

    - Reviewing Netgraph modules for correct interactions with locking in the
      remainder of the system. I've started pushing some locking into
      ng_ksocket.c and ng_socket.c, and some of the basic infrastructure that
      needed it, but each module will need to be reviewed for correct locking.

    - ATM -- Harti? :-)

    - Network device drivers -- some have locking, some have correct locking,
      some have potential interactions with other pieces of the system (such
      as the USB stack). Note that for a driver to work correctly with a
      Giant-free system, it must be safe to invoke ifp->if_start() without
      holding Giant, and for if_start() to be aware that it cannot
      acquire Giant without generating a lock order issue. It's OK for
      if_input() to be called with Giant, although undesirable generally.
      Some drivers also have locking that is commented out by default due to
      use of recursive locks, but I'm not sure this is necessarily sufficient
      problem not to just turn on the locking.

    - Complete coverage of synthetic/pseudo-interfaces. In particular,
      careful addressing of if_gif and other "cross-layer" and protocol aware
      pieces.

    - mbuma -- Bosko's work looks good to me, we need to make sure all the
      pieces work with each other. Getting down to one large memory allocator
      would be great. I'm interested in exploring uniprocessor optimizations
      here -- I notice that a lot of the locks getting acquired in profiling
      are for memory allocation. Exploring using critical sections, per-cpu
      variables/caching, and pinning both seem like reasonable approaches to
      reduce synchronization costs here.

    Note that there are some serious issues with the current locking changes:

    - Socket locking is deficient in a number of ways -- primarily that there
      are several important socket fields that are currently insufficiently or
      inconsistently synchronized. I'm in the throes of correcting this, but
      that requires a line-by-line review of all use of sockets, which will
      take me at least another week or two to complete. I'm also addressing
      some races between listen sockets and the sockets hung off of them
      during the new connection setup and accept process. Currently there is
      no defined lock order between multiple sockets, and if possible I'd like
      to keep it that way.

    - Based on the BSD/OS strategy, there are two mutexes on a socket: each
      socket buffer has a mutex (send, receive), and then the basic socket
      fields are locked using SOCK_LOCK(), which actually uses the receive
      socket buffer mutex. This reduces the locking overhead while helping to
      address ordering issues in the upward and downward paths. However,
      there are also some issues of locking correctness and redundancy, and
      I'm looking into these as part of an overall review of the strategy.
      It's worth noting that the BSD/OS snapshot we have has substantially
      incomplete and non-functional socket locking, so unlike some other
      pieces of the network stack, it was not possible to use the strategy
      whole-cloth. In the long term, the socket locking model may require
      substantial revision.

    - Per some recent discussions on -CURRENT, I've been exploring mitigating
      locking costs through coalescing activities on multiple packets. I.e.,
      effectively passing in queues of packet chains across API boundaries, as
      well as creating local work queues. It's a bit early to commit to this
      approach because the performance numbers have not confirmed the benefit,
      but it's important to keep that possible approach in mind across all
      other locking work, as it trades off work queue latency with
      synchronization cost. My earlier experimentation occurred at the end of
      2003, so I hope to revisit this now that more of the locking is in place
      to offer us advantages in preemption and parallelism.

    - They enable net.isr.enable by default, which provides inbound packet
      parallelism through running to completion in the ithread. This has
      other down sides, and while we should provide the option, I think we
      should continue to support forcing use of the netisr. One of the
      problems with the netisr approach is how to accomplish inbound
      processing parallelism without sacrificing the currently strong ordering
      properties, which could cause bad TCP behavior, etc. We should seriously
      consider at least some aspects of Jeffrey Hsu's work on DragonFly
      to explore providing for multiple netisr's bound to CPUs, then directing
      traffic based on protocol aware hashing that permits us to maintain
      sufficient ordering to meeting higher level protocol requirements while
      avoiding the cost of maintaining full ordering. This isn't something we
      have to do immediately, but exploiting parallelism requires both
      effective synchronization and effective balancing of load.

      In the short term, I'm less interested in the avoidance of
      synchronization of data adopted in the DragonFly approach, since I'd
      like to see that approach validated on a larger chunk of the stack
      (i.e., across the more incestuous pieces of the network stack), and also
      to see performance numbers that confirm the claims. The approach we're
      currently taking is tried and true across a broad array of systems
      (almost every commercial UNIX vendor, for example), and offers many
      benefits (such as a very strong assertion model). However, as aspects
      of the DFBSD approach are validated (or not, as the case may be), we
      should consider adopting things as they make sense. The approaches
      offer quite a bit of promise, but are also very experimental and will
      require a lot of validation, needless to say. I've done a little bit of
      work to start applying the load distribution approach on FreeBSD, but
      need to work more on the netisr infrastructure before I'll be able to
      evaluate its effectiveness there.

    - There are still some serious issues in the timely processing and
      scheduling of device driver interrupts, and these affect performance in
      a number of ways. They also change the degree of effective coalescing
      of interrupts, making it harder to evaluate strategies to lower costs.
      These issues aren't limited to the network stack work, but I wanted to
      make sure it was on the list of concerns. Improving our scheduling and
      handling of interrupts will be critical to realizing the performance
      benefits SMPng has offered.

    - There are issues relating to upcalls from the socket layer: while many
      consumers of sockets simply sleep for wakeups on socket pointers,
      so_upcall() permits the network stack to "upcall" into other components
      of the system. I believe this was introduced initially for the NFS
      server to allow initial processing of RPCs to occur in the netisr rather
      than waiting on a context switch to the NFS server threads. However,
      it's now also used for accept sockets, and I'm aware of outstanding
      changes that modify the NFS client to use it as well. We need to
      establish what locks will be held over the upcall, if any, and what
      expectations are in place for implementers of upcall functions. At the
      very least, they have to be MPSAFE, but there are also potential lock
      order issues.

    - Locking for KQueue is critical to success. Without locking down the
      event infrastructure, we can't remove Giant from the many interesting
      pieces of the network stack. KQueue is an example of a high level of
      incestuousness between levels, and will require careful handling.
      Brian's approach adopts a "single subsystem" for KQueue and as such
      offers a low hanging fruit approach, but comes at a number of costs, not
      least is parallelism loss and functional loss. John-Mark's approach
      appears to offer a more granular locking approach offering higher
      parallelism, but at the cost of complexity. I've not yet had the
      opportunity to review either in any detail, but I know Brian has
      integrated a work branch in Perforce that combines both the locking in
      rwatson_netperf, and perform testing. There's obviously more work to go
      on here, and it is required to get to "Giant-free operation".

    For more complete changes and history, I would refer you to the last few
    FreeBSD Status Reports on network stack locking. I would also encourage
    you to contact me if you would like to claim some section of the stack for
    work so I can coordinate activities. These patch sets have been pounded
    heavily in a wide variety of environments, but there are several known
    issues so I would recommend using them cautiously.

    In terms of merging: I've been gradually merging a lot of the
    infrastructure pieces as I went along. The next big chunks to consider
    merging are:

    - Socket locking. This needs to wait until I'm more happy with the
      strategy.

    - UNIX domain socket locking. This is probably an early candidate, but
      because of potential interactions with socket locking changes, I've been
      deferring the merge.

    - NFS server locking. I had planned to merge the current subsystem lock
      quickly, but then Rick turned up with fine-grained data based locking of
      the NFS server, and NFSv4 server code when I asked him for review of the
      subsystem lock, so I've been holding off.

    - Additional general infrastructure, such as more psuedo-interface
      locking, fifofs stuff, etc. I'll continue on the gradual incremental
      merge path as I have been for the past few months.

    It's obviously desirable to get things merged as soon as they are ready,
    even with Giant remaining over the stack, so that we can get broad
    exercising of the locking assertions in INVARIANTS and WITNESS. As such,
    over the next month I anticipate an increasing number of merges, and
    increasing usability of "debug.mpsafenet" in the main tree. Turning off
    Giant will likely lead to problems for some time to come, but the sooner
    we get exposure, the better life will be. We've done a lot of heavy
    testing of common code paths, but working out the edge cases will take
    some time. We're prepared to live in a world with a dual-mode stack for
    some period, but that has to be an interim measure.

    So I guess the upshot is "Stuff is going on, be aware, volunteer to
    help!".

    Robert N M Watson FreeBSD Core Team, TrustedBSD Projects
    robert@fledge.watson.org Senior Research Scientist, McAfee Research

    _______________________________________________
    freebsd-arch@freebsd.org mailing list
    http://lists.freebsd.org/mailman/listinfo/freebsd-arch
    To unsubscribe, send any mail to "freebsd-arch-unsubscribe@freebsd.org"


  • Next message: Julian Elischer: "atomic reference counting primatives."

    Relevant Pages

    • Re: [FreeBSD6.1-RELEASE]problem about soisconnected() in uipc_socket2.c
      ... the SOCK_LOCK() would lock the socket. ... You can check the field locking key in socketvar.h for details: ... /* queue of partial unaccepted connections */ ...
      (freebsd-net)
    • Re: [VW] file locking
      ... I guess that on Windows you already have sufficient locking as-is. ... I recall it being pretty simple to implement, the only problem being that it is not robust when applications crash (stale lock files hanging around). ... the idea here is to open a server socket on a particular port. ... If you open a socket, and then use 'exec' to spawn a process, than your new child process will inherit the 'open' socket. ...
      (comp.lang.smalltalk)
    • Re: Is socket buffer locking as questionable as it seems?
      ... > locking the socket buffer. ... The system calls are marked MPSAFE in the case of the socket calls because ... opposed to Giant being grabbed by the system call code itself. ... Peter's has picked up the task of doing a driver API sweep to provide ...
      (freebsd-hackers)
    • Re: Locking Wheel Nut
      ... At a service today the garage asked where the key was for the locking ... wheel nuts. ... I now realise what the odd looking socket was for. ...
      (uk.rec.cars.maintenance)
    • Re: A new Critical Section for high contention situations
      ... locking anything. ... stack which is previously filled with 9000.000 items: ...
      (microsoft.public.win32.programmer.kernel)

  • Quantcast