Coming soon: default to Giant-free networking in 6.x (was: Running the network stack without Giant -- change in default coming (fwd))

From: Robert Watson (rwatson_at_FreeBSD.org)
Date: 08/27/04

  • Next message: Søren Schmidt: "Re: Sil3114 SATA RAID"
    Date: Fri, 27 Aug 2004 17:51:25 -0400 (EDT)
    To: current@FreeBSD.org
    
    

    This evening I will commit several changes to the 6.x branch relating to
    network stack locking. They will include the following:

    - Change in the default setting for Giant over the network stack. The
      practical impact is that the flag debug.mpsafenet will change from a
      default value of 0 to a default value of 1.

    - Infrastructure to allow klds and other components to declare a
      dependency on the use of Giant over the network stack, resulting in a
      boot-time (or load-time) warning about possible unsafe operation.

    - If a module/component is present from early in the boot and requires
      Giant, it will restore the default setting of debug.mpsafenet to 0 from
      1, and generate a warning indicating a degraded mode of operation is in
      effect.

    - Addition of a "NET_WANT_GIANT" option to the kernel configuration to
      allow the default setting of debug.mpsafenet to be set as a compile-time
      property of a kernel configuration. You will still be able to override
      that setting using debug.mpsafenet, but it will restore the current
      status quo for a system without debug.mpsafenet explicitly set.

    Modules/components that will declare a dependency on Giant are currently
    limited to:

    - KAME IPSEC, which currently does not have sufficient locking to operate
      correctly without Giant. FAST_IPSEC is able to run without Giant, but
      does not support IPv6. I'm currently exploring locking for KAME IPSEC,
      but it's a non-trivial task.

    - Netgraph "tty" module, which interacts with the TTY subsystem from the
      network context. The TTY subsystem currently requires Giant. I have
      not yet begun exploring how to address this issue.

    There are components of IPv6 that are not MPSAFE, but they are
    sufficiently minor components that they will be relatively safe in most
    scenarios, and it's probably reasonable to leave them enabled by default.
    We're working on completing locking for those components.

    I will post a HEADS UP post when the changes are complete and merged, and
    provide details regarding how these changes should affect choices in
    kernel configuration.

    Robert N M Watson FreeBSD Core Team, TrustedBSD Projects
    robert@fledge.watson.org Principal Research Scientist, McAfee Research

    ---------- Forwarded message ----------
    Date: Tue, 24 Aug 2004 10:30:43 -0400 (EDT)
    From: Robert Watson <rwatson@FreeBSD.org>
    To: current@FreeBSD.org
    Subject: Running the network stack without Giant -- change in default coming

    For some time, one of the major goals of the FreeBSD Project has been to
    allow the network stack to run in parallel on multiple processors at a
    time. Per my July 19, 2004 post to the freebsd-current mailing list, much
    of this support has now been merged to the FreeBSD 5-CURRENT branch (and
    now 6-CURRENT), with the intent of shipping this support in 5.3. And, per
    that post, it's now possible to run large parts of the network stack in
    this manner through the use of a system tunable at boot, debug.mpsafenet.
    This can result in a variety of performance benefits, especially on SMP,
    by improving concurrency and reducing latency. While it presents a "first
    cut" locking strategy, these benefits are still pretty tangible, and the
    resulting system is an excellent starting architecture for a broad range
    of performance work.

    Right now, that tunable "debug.mpsafenet" defaults to off (0) in the
    5-CURRENT and 6-CURRENT branches. However, this will shortly change in
    6-CURRENT to on (1), as most commonly exercised parts of the network stack
    are now ready for testing in this environment. Some caveats before I go
    into the details as to how to determine whether this is right for you:

    - While we've been doing pretty heavy testing in MPSAFE configurations,
      the nature of multiprocessor development and adapting code for MP safety
      means that it's unlikely this will "just work" for every last person who
      tries it. However, it appears to work well in a broad variety of
      environments and with fairly strenuous testing.

    - We've focussed primarily on getting mainstream network configurations to
      run without Giant: this means that less mainstream subsystems (parts of
      IPv6, some netgraph nodes, IPX, etc) are currently unsafe without the
      Giant lock turned on. Less mainstream network devices, even if the
      device drivers are not able to run without the Giant lock. are able to
      operate without Giant over the remainder of the stack due to
      compatibility code. This code comes with a performance penalty beyond
      just running with the Giant lock, so there is a strong motivation to
      complete locking for these straggling drivers.

    - You may run into hard to diagnose problems. We'd like to try to
      diagnose them anyway, but if you start to experience new problems,
      you'll want to go read the Handbook chapter on preparing kernel bug
      reports and diagnosing problems. You'll also want to be prepared to run
      the system with INVARIANTS and WITNESS turned on. The first step in
      debugging will be to try running with Giant turned back on by changing
      the debug.mpsafenet flag and seeing if the problem can be reproduced.
      Details below.

    - Not all workloads will experience a performance benefit -- some, for
      various reasons, will get worse. However, several interesting
      performance loads get measurably better. If you don't see an
      improvement, or you see things get worse, please don't be surprised --
      you may want to look at some of the suggestions I make below on ways to
      make the results more predictable. Generally, you shouldn't see
      substantial performance degradation, if any, but it can't be ruled out,
      especially due to outstanding scheduler issues that are being worked on.

    - We can and will destroy your data. We don't mean to, because we like
      your data (and you!), and we try not to, but this is, after all,
      operating system development, and comes with risks.

    With this in mind, now is a good time to increase exposure for these
    changes, because they will become the default in the near future.

    Here's some technical information on how to get started:

    (1) Determine if all of the stack components you will operate with are
        MPsafe. For common configurations, answering the following questions
        will help you decide this:

            - Are you actively using IPv6, IPX, ATM, or KAME IPSEC? If you
              answered yes to any of these questions, it is not yet safe for
              you to run without Giant. Note that most use of IPv6 is safe,
              but there are some areas (multicast) that are not entirely safe
              yet.

            - Are your using Netgraph? If yes, it may be that you are not yet
              able to run without Giant. The framework and many nodes are
              MPSAFE, but some remain that are not. It is worth giving it a
              try, but you may experience panics, etc, especially in MP
              configurations.

            - Are you using SLIP or kernel PPP (not to be confused with user
              ppp, which is what most FreeBSD users use with modems). If so,
              there are experimental patches to make SLIP safe, but out of the
              box you may see lock assertion failures. We are working to
              resolve this issue.

            - Are you using any physical network interfaces other than the
              following: ath, bge, dc, em, ep, fxp, rl, sis, xl, wi. If so,
              you may see a performance drop.

              NOTE: Do you maintain a network interface driver? Is it not on
              this list? Shame on you! Or maybe shame on me for not listing
              it, even though it should work. Drop me a private e-mail with
              any questions or comments. Please update the busdma driver
              status web page with your driver's status.

    (2) If you are comfortable that you are using an MPSAFE-supported
        configuration, then you can use the following tunable in loader.conf
        to disable the Giant lock over the network stack on your system:

            debug.mpsafenet="1"

        Note that this is a boot-time only flag; you can inspect the setting
        with a sysctl, but it cannot currently be changed at runtime. You
        will need to reboot for the change to take effect.

        Once the default has changed, it will be necessary to explicitly
        disable Giant-free networking if that is the desired operating mode.
        Specifically, you will need to place the following in loader.conf to
        get that mode of operation:

            debug.mpsafenet="0"

    Some notes:

    On SMP-centric performance measurements, such as local UNIX domain socket
    use by MySQL on MP systems, I've observed 30%-40% performance improvements
    by disabling Giant (some details below). My recommended configuration for
    testing out the impact of disabling Giant on MP systems is:

    - Running with adaptive mutexes (now the default) and with ADAPTIVE_GIANT
      (also now the default) appears to make a big difference.

    - Try disabling HTT. In my workloads, which tend to pound the kernel,
      HTT appears to hurt quite a bit. Obviously, the effectiveness of HTT
      depends on the instruction mix, so this may not be for you. Builds, for
      example, may benefit.

    - Pick one of ULE and 4BSD, and then try the other. I found 4BSD helped a
      lot for MySQL, but I've seen other benchmarks with quite different
      results.

    - For stability purposes with MySQL, I currently have to disable
      PREEMPTION (currently the default), as the MySQL benchmarks I use are
      pretty thread-centric and trigger preemption-related bugs with the
      kernel threading bits. Recent work-arounds committed should resolve
      this but I have not yet run stability tests.

    - If you want to measure performance, make sure to disable INVARIANTS,
      INVARIANTS_SUPPORT, WITNESS, etc. Also, confirm that the userland
      malloc debugging features are disabled, as they add cost to each free()
      operation. I believe we now have a handbook with a variety of
      recommendations on performance measurement, such as disabling various
      daemons (such as dhclient, etc). For latency measurements, PREEMPTION
      is generally desired, subject to stability.

    - To increase parallelism, especially for inbound packet paths on multiple
      interfaces, set the sysctl/tunable net.isr.enable=1, which enables
      direct dispatch in network interface ithreads, rather than defering to
      the netisr thread. If each interface is assigned a different ithread,
      their inbound processing paths can run in parallel, as well as with loop
      back traffic running in the global netisr thread. We have additional
      work to do here in terms of increasing the chances of parallel dispatch,
      etc, and it could be some environments this is not a useful setting.
      I'd be interested in learning about the environments where a negative
      performance impact is measured.

    Some notes on bug reporting:

    - Make sure to identify that you are running with debug.mpsafenet on. If
      the problem is reproduceable, make sure to indicate if it goes away or
      persists when you disable debug.mpsafenet. This will help to
      distinguish network stack problems which are (and are not) a result of
      this work.

    - If you appear to be experiencing a hang/deadlock, please try running
      with WITNESS. I'd actually like to see most people running with WITNESS
      for a bit to shake out lock order issues, as I've introduced a lot of
      orders. If experiencing lock order reversals, please include the full
      console warning including stack trace and any warning messages prior to
      the trace identifying locks, etc. If dropped to DDB, "show locks" is
      useful.

    - INVARIANTS also considered good. Even if you aren't running with
      WITNESS, do run with INVARIANTS. Note that there is a measurable
      performance hit for doing so.

    - If you experience a hang, see if you can get into DDB -- if you are
      having problems getting in using a console break, try a serial console.
      When debugging, at minimum DDB 'ps' output, along with traces of
      interesting processes. Typically interesting will be processes that
      appear to be involved in the hang, etc. Obviously, this requires some
      intuition about what causes the hang and I can't offer hard and fast
      rules here. NMI, SW_WATCHDOG, and MP_WATCHDOG can all increase the
      chances of getting to DDB even in hard hangs.

    - Experimenting with debug.mpsafenet=1 and UP is also interesting, not
      just SMP. With PREEMPTION turned on, it may result in lower latency
      and/or lower throughput. Or not. Regardless, it's interesting -- you
      don't have to have SMP to give it a spin.

    FYI, while results can and will vary, I was pleased to observe moving from
    a UP->MP speedup of 1.07 on a dual-processor box to a speedup of 1.42 with
    the supersmack benchmark using 11 workers and 1000 select transactions
    with MySQL. For reference, that was with the 4BSD scheduler and adaptive
    mutexes. For loopback netperf with TCP and UDP, I observed no change in
    performance (well, 1% better for UDP RR, but basically no change). Note
    that the MySQL benchmark here is basically a UNIX domain socket IPC test,
    and so real world databases will give pretty different results since they
    won't be pure IPC. The results appear to be very sensitive to the choice
    of scheduler, and for a variety of reasons I've preferred 4BSD during
    recent testing (not least, better results in terms of throughput).

    There are a lot of people who have been working on this for quite some
    time -- I can't thank them all here, but I will point at the netperf web
    page as a place to look for ongoing patches, change logs, and some
    credits:

        http://www.watson.org/~robert/freebsd/netperf/

    The hard work and contributions of these many developers over several
    years is finally coming to fruition! I try to keep it up to date about
    once a week or so as I drop new patch sets. There's also an RSS feed on
    the change log, which is fairly technical but might be interesting to some
    readers.

    Robert N M Watson FreeBSD Core Team, TrustedBSD Projects
    robert@fledge.watson.org Principal Research Scientist, McAfee Research

    _______________________________________________
    freebsd-current@freebsd.org mailing list
    http://lists.freebsd.org/mailman/listinfo/freebsd-current
    To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"

    _______________________________________________
    freebsd-current@freebsd.org mailing list
    http://lists.freebsd.org/mailman/listinfo/freebsd-current
    To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"


  • Next message: Søren Schmidt: "Re: Sil3114 SATA RAID"

    Relevant Pages