Re: Call for performance evaluation: net.isr.direct

From: Robert Watson (rwatson_at_FreeBSD.org)
Date: 10/11/05

  • Next message: gnn_at_freebsd.org: "Re: Call for performance evaluation: net.isr.direct"
    Date: Tue, 11 Oct 2005 15:01:11 +0100 (BST)
    To: performance@FreeBSD.org
    
    

    On Wed, 5 Oct 2005, Robert Watson wrote:

    > In 2003, Jonathan Lemon added initial support for direct dispatch of
    > netisr handlers from the calling thread, as part of his DARPA/NAI Labs
    > contract in the DARPA CHATS research program. Over the last two years
    > since then, Sam Leffler and I have worked to refine this implementation,
    > removing a number of ordering related issues, opportunities for
    > excessive parallelism, recursion issues, and testing with a broad range
    > of network components. There has also been a significant effort to
    > complete MPSAFE locking work throughout the network stack. Combined
    > with the earlier move to ithreads and a functional direct dispatch
    > ("process to completion" implementation), there are a number of exciting
    > possible benefits.

    If I don't hear anything back in the near future, I will commit a change
    to 7.x to make direct dispatch the default, in order to let a broader
    community do the testing. :-) If you are setup to easily test stability
    and performance relating to direct dispatch, I would appreciate any help.

    As of 6.0-RC1 and recent 7.x, the name of the sysctl is "net.isr.direct";
    previously it has been named "net.isr.enable", but its use is not
    recommend in versions that do not use the new name.

    Thanks,

    Robert N M Watson

    >
    > - Possible parallelism by packet source -- ithreads can dispatch
    > simultaenously into the higher level network stack layers. Since
    > ithreads can execute in parallel on different CPU, so can code they
    > invoke directly.
    >
    > - Elimination of context switches in the network receive path -- rather
    > than context switching to the netisr thread from the ithread, we can now
    > directly execute netisr code from the ithread.
    >
    > - A CPU-bound netisr thread on a multi-processor system will no longer
    > rate limit traffic to the available resources on one CPU.
    >
    > - Eliminating the additional queueing in the handoff reduces the
    > opportunity for queues to overfill as a result of scheduling delays.
    >
    > There are, however, some possible downsides and/or trade-offs:
    >
    > - Higher level network processing will now compete with the interrupt
    > handler for CPU resources available to the ithread. This means less
    > time for the interrupt code to execute in the thread if the thread is
    > CPU-bound.
    >
    > - Lower levels of parallelism between portions of the inbound packet
    > processing path. Without direct dispatch, there is possible parallelism
    > between receive network driver execution and higher level stack layers,
    > whereas with direct dispatch they can no longer execute in parallel.
    >
    > - Re-queued packets from tunnel and encapsulation processing will now
    > require a context switch to process, since they will be processed in the
    > netisr proper rather than in the ithread, whereas before the netisr
    > thread would pick them up immediately after completing the current
    > processing without a context switch.
    >
    > - Code that previously ran in the SWI at a SWI priority now runs in the
    > ithread at an ithread priority, elevating the general priority at which
    > network processing takes place.
    >
    > And there are a few mixed things, that can offer good and bad elements:
    >
    > - Less queueing takes place in the network stack in in-bound processing:
    > packets are taken directly from the driver and processed to completion
    > one by one, rather than queued for batch processing. Packets will be
    > dropped before the link layer, rather than on the boundary between the
    > link and protocol layers. This is good in that we invest less work in
    > packets we were going to drop anyway, but bad in that less queueing
    > means less room for scheduling delays.
    >
    > In previous FreeBSD releases, such as several 5.x series releases,
    > net.isr.enable could not be turned on by default because there was
    > insufficient synchronization in the network stack. As of 5.5 and 6.0, I
    > believe there is sufficient synchronization, especially given that we force
    > non-MPSAFE protocol handlers to run in the netisr without direct dispatch.
    > As such, there has been a gradual conversation going on about making direct
    > dispatch the default behavior in the 7.x development series, and more
    > publically documenting and supporting the use of direct dispatch in the 6.x
    > release engineering series.
    >
    > Obviously, this is about two things: performance, and stability. Many of us
    > have been running with direct dispatch on by default for quite some time, so
    > it passes some of the basic "does it run" tests. However, since it
    > significantly increases the opportunity for parallelism in the receive path
    > of the network stack, it likely will trigger otherwise latent or infrequent
    > races and bugs to occur more frequently. The second aspect is performance:
    > many results suggest that direct dispatch has a significant performance
    > benefit. However, evaluating the impact on a broad range of results is
    > required in order for us to go ahead with what is effectively a significant
    > architectural change in how we perform network stack processing.
    >
    > To give you a sense of some of the performance effect I've measured recently,
    > using the netperf measurement tool (with -DHISTOGRAM removed from the FreeBSD
    > port build), here are some results. In each case, I've put parenthesis
    > around host or router to indicate which is the host where the configuration
    > change is being tested. These tests were performed using dual Xeon systems,
    > and using back-to-back gigabit ethernet cards and the if_em driver:
    >
    > TCP round trip benchmark (TCP_RR), host-(host):
    >
    > 7.x UP: 0.9% performance improvement
    > 7.x SMP: 0.7% performance improvement
    >
    > TCP round trip benchmark (TCP_RR), host-(router)-host:
    >
    > 7.x UP: 2.4% performance improvement
    > 7.x SMP: 2.9% performance improvement
    >
    > UDP round trip benchmark (UDP_RR), host-(host):
    >
    > 7.x UP: 0.7% performance improvement
    > 7.x SMP: 0.6% performance improvement
    >
    > UDP round trip benchmark (UDP_RR), host-(router)-host:
    >
    > 7.x UP: 2.2% performance improvement
    > 7.x SMP: 3.0% performance improvement
    >
    > TCP stream banchmark (TCP_STREAM), host-(host):
    >
    > 7.x UP: 0.8% performance improvement
    > 7.x SMP: 1.8% performance improvement
    >
    > TCP stream benchmark (TCP_STREAM), host-(router)-host:
    >
    > 7.x UP: 13.6% performance improvement
    > 7.x SMP: 15.7% performance improvement
    >
    > UDP stream benchmark (UDP_STREAM), host-(host):
    >
    > 7.x UP: none
    > 7.x SMP: none
    >
    > UDP stream benchmark (UDP_STREAM), host-(router)-host:
    >
    > 7.x UP: none
    > 7.x SMP: none
    >
    > TCP connect benchmark (src/tools/tools/netrate/tcpconnect)
    >
    > 7.x UP: 7.90383% +/- 0.553773%
    > 7.x SMP: 12.2391% +/- 0.500561%
    >
    > So in some cases, the impact is negligible -- in other places, it is quite
    > significant. So far, I've not measured a case where performance has gotten
    > worse, but that's probably because I've only been measuring a limited number
    > of cases, and with a fairly limited scope of configurations, especially given
    > that the hardware I have is pushing the limits of what the wire supports, so
    > minor changes in latency are possible, but not large changes in throughput.
    >
    > So other than a summary of the status quo, this is also a call to action. I
    > would like to get more widespread benchmarking of the impact of direct
    > dispatch on network-related workloads. This means a variety of things:
    >
    > (1) Performance of low level network services, such as routing, bridging,
    > and filtering.
    >
    > (2) Performance of high level application servces, such as web and
    > database.
    >
    > (3) Performance of integrated kernel network services, such as the NFS
    > client and server.
    >
    > (4) Performance of user space distributed file systems, such as Samba and
    > AFS.
    >
    > All you need to do to switch to direct dispatch mode is set the sysctl or
    > tunable "net.isr.dispatch" to 1. To disable it again, remove the setting, or
    > set it to 0. It can be modified at run-time, although during the transition
    > from one mode to the other, there may be a small quantity of packet
    > misordering, so benchmarking over the transition is discouraged.
    > FYI: as of 6.0-RC1 and recent 7.0, net.isr.dispatch is the name of the
    > variable. In earlier releases, the name of this variable was net.isr.enable.
    >
    > Some important details:
    >
    > - Only non-local protocol traffic is affected: loopback traffic still goes
    > via the netisr to avoid issues of recursion and lock order.
    >
    > - In the general case, only in-bound traffic is directly affected by this
    > change. As such, send-only benchmarks may reveal little change. They
    > are still interesting, however.
    >
    > - However, the send path is indirectly affected due to changes in
    > scheduling, workload, interrupt handling, and so on.
    >
    > - Because network benchmarks, especially micro-benchmarks, are especially
    > sensitive to minor perturbations, I highly recommend running in a
    > minimal multi-user or ideally single-user environment, and suggest
    > isolating undesired sources of network traffic from segments where
    > testing is occuring. For macro-benchmarks this can be less important,
    > but should be paid attention to.
    >
    > - Please make sure debugging features are turned off when running tests --
    > especially WITNESS, INVARIANTS, INVARIANT_SUPPORT, and user space malloc
    > debugging. These can have a significant impact on performance, both
    > potentially overshadowing changes, and in some cases, actually reversing
    > results (due to higher overhead under locks, for example).
    >
    > - Do not use net.isr.enable in the 5.x line unless you know what you are
    > doing. While it is reasonably safe with 5.4 forwards, it is not a
    > supported configuration, and may cause stability issues with specific
    > workloads.
    >
    > - What we're particularly interested in is a statistically meaningful
    > comparison of the "before" and "after" case. When doing measurements, I
    > like to run 10-12 samples, and usually discard the first one or two,
    > depending on the details of the benchmark. I'll then use
    > src/tools/tools/ministat to compare the data sets. Running a number of
    > samples is quite important, because the variance in many tests can be
    > significant, and if the two sample sets overlap, you can quite easily
    > draw the entirely wrong conclusion about the results from a small number
    > of measurements in a sample.
    >
    > Assuming you have a fixed width font, typicaly output from ministat looks
    > something like the following and may be human readable:
    >
    > x 7SMP/tcpconnect_queue
    > + 7SMP/tcpconnect_direct
    > +--------------------------------------------------------------------------+
    > |x xx + +|
    > |xxxxx xx ++ +++++ +|
    > ||__A__| |___A__| |
    > +--------------------------------------------------------------------------+
    > N Min Max Median Avg Stddev
    > x 10 5425 5503 5460 5456.3 26.284977
    > + 10 6074 6169 6126 6124.1 31.606785
    > Difference at 95.0% confidence
    > 667.8 +/- 27.3121
    > 12.2391% +/- 0.500561%
    > (Student's t, pooled s = 29.0679)
    >
    > Of particular interest is if changing to direct dispatch hurts performance in
    > your environment, and understanding why that is.
    >
    > Thanks,
    >
    > Robert N M Watson
    > _______________________________________________
    > freebsd-performance@freebsd.org mailing list
    > http://lists.freebsd.org/mailman/listinfo/freebsd-performance
    > To unsubscribe, send any mail to
    > "freebsd-performance-unsubscribe@freebsd.org"
    >
    _______________________________________________
    freebsd-performance@freebsd.org mailing list
    http://lists.freebsd.org/mailman/listinfo/freebsd-performance
    To unsubscribe, send any mail to "freebsd-performance-unsubscribe@freebsd.org"


  • Next message: gnn_at_freebsd.org: "Re: Call for performance evaluation: net.isr.direct"

    Relevant Pages

    • Call for performance evaluation: net.isr.direct (fwd)
      ... Jonathan Lemon added initial support for direct dispatch of netisr ... testing with a broad range of network components. ... directly execute netisr code from the ithread. ... 7.x SMP: 0.7% performance improvement ...
      (freebsd-arch)
    • Re: Call for performance evaluation: net.isr.direct
      ... > netisr handlers from the calling thread, as part of his DARPA/NAI Labs ... > complete MPSAFE locking work throughout the network stack. ... to 7.x to make direct dispatch the default, in order to let a broader ... > directly execute netisr code from the ithread. ...
      (freebsd-net)
    • Call for performance evaluation: net.isr.direct
      ... netisr handlers from the calling thread, as part of his DARPA/NAI Labs ... locking work throughout the network stack. ... to ithreads and a functional direct dispatch ("process to completion" ... 7.x SMP: 0.7% performance improvement ...
      (freebsd-net)
    • Call for performance evaluation: net.isr.direct
      ... netisr handlers from the calling thread, as part of his DARPA/NAI Labs ... locking work throughout the network stack. ... to ithreads and a functional direct dispatch ("process to completion" ... 7.x SMP: 0.7% performance improvement ...
      (freebsd-performance)