Re: Advice on a multithreaded netisr patch?







--- On Sun, 4/5/09, Robert Watson <rwatson@xxxxxxxxxxx> wrote:

From: Robert Watson <rwatson@xxxxxxxxxxx>
Subject: Re: Advice on a multithreaded netisr patch?
To: "Ivan Voras" <ivoras@xxxxxxxxxxx>
Cc: freebsd-net@xxxxxxxxxxx
Date: Sunday, April 5, 2009, 9:54 AM
On Sun, 5 Apr 2009, Ivan Voras wrote:

I thought this has something to deal with NIC
moderation (em) but can't really explain it. The bad
performance part (not the jump) is also visible over the
loopback interface.

FYI, if you want high performance, you really want
a card supporting multiple input queues -- igb, cxgb, mxge,
etc. if_em-only cards are fundamentally less scalable in an
SMP environment because they require input or output to
occur only from one CPU at a time.

Makes sense, but on the other hand - I see people are
routing at least 250,000 packets per seconds per direction
with these cards, so they probably aren't the bottleneck
(pro/1000 pt on pci-e).

The argument is not that they are slower (although they
probably are a bit slower), rather that they introduce
serialization bottlenecks by requiring synchronization
between CPUs in order to distribute the work. Certainly
some of the scalability issues in the stack are not a result
of that, but a good number are.

Historically, we've had a number of bottlenecks in,
say, the bulk data receive and send paths, such as:

- Initial receipt and processing of packets on a single CPU
as a result of a
single input queue from the hardware. Addressed by using
multiple input
queue hardware with appropriately configured drivers
(generally the default
is to use multiple input queues in 7.x and 8.x for
supporting hardware).

- Cache line contention on stats data structures in drivers
and various levels
of the network stack due to bouncing around exclusive
ownership of the cache
line. ifnet introduces at least a few, but I think most
of the interesting
ones are at the IP and TCP layers for receipt.

- Global locks protecting connection lists, all rwlocks as
of 7.1, but not
necessarily always used read-only for packet processing.
For UDP we do a
very good job at avoiding write locks, but for TCP in 7.x
we still use a
global write lock, if briefly, for every packet.
There's a change in 8.x to
use a global read lock for most packets, especially
steady state packets,
but I didn't merge it for 7.2 because it's not
well-benchmarked. Assuming I
get positive feedback from more people, I will merge them
before 7.3.

- If the user application is multi-threaded and receiving
from many threads at
once, we see contention on the file descriptor table
lock. This was
markedly improved by the file descriptor table locking
rewrite in 7.0, but
we're continuing to look for ways to mitigate this.
A lockless approach
would be really nice...

On the transmit path, the bottlenecks are similar but
different:

- Neither 7.x nor 8.x supports multiple transmit queues as
shipped; Kip has
patches for both that add it for cxgb. Maintaining
ordering here, and
ideally affinity to the appropriate associated input
queue, is important.
As the patches aren't in the tree yet, or for
single-queue drivers,
contention on the device driver send path and queues can
be significant,
especially for device drivers where the send and receive
path are protected
by the same lock (bge!).


I'm curious as to your assertion that hardware transmit queues are a
big win. You're really just loading a transmit ring well ahead of actual transmission; there's no need to force a "start" for
each packet queued. You then have more overheard managing the multiple
queues; more memory used, more cpu cache needed, more interrupts
(perhaps), overhead generating the flowid. It seems to me that a more
efficient method of transmitting, such as offloading the transmit
workload to a kernel task, would be more effective than using
multiple transmit queues. All the source thread has to do is queue
the packet and get out.

As an aside, why is Kip doing development on a Chelsio card rather
than a more mainstream product such as Intel or Broadcom that would
generate more widespread interest?

Barney



_______________________________________________
freebsd-net@xxxxxxxxxxx mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscribe@xxxxxxxxxxx"



Relevant Pages

  • Re: Advice on a multithreaded netisr patch?
    ... You then have more overheard managing the multiple queues; more memory used, more cpu cache needed, more interrupts, overhead generating the flowid. ... It seems to me that a more efficient method of transmitting, such as offloading the transmit workload to a kernel task, would be more effective than using multiple transmit queues. ... The lock only coveres the queue, but the overhead of a single high contention lock twice for every packet is significant at high pps and with many cores. ...
    (freebsd-net)
  • Re: Advice on a multithreaded netisr patch?
    ... transactions on multi-core systems, and I'm hitting a limit where a kernel task, usually swi:net hits 100% of a CPU at some transactions/s rate and blocks further performance increase even though other cores are 100% idle. ... One of the fundamental problems with hashing packets to distribute work is that it involves taking cache misses on packet headers, not just once, but twice, which often is one of the largest costs in processing packets. ... Kip Macy has patches to support multiple output queues on cxgb, which should facilitate support for other drivers as well, and the plan is to get that in 8.0 as well. ... We're, briefly, in a period where input queue count is about the same density as CPU cores; it's not entirely clear, but we may soon be back in a situation where CPU core count exceeds queues, in which case doing software work placement will continue to be important. ...
    (freebsd-net)
  • Re: Reducing vm page queue mutex contention
    ... because this lock protects a lot of things: ... I would concentrate on entirely eliminating the use of the page queues lock from pmap_enterand pmap_removes_pages. ... While a mutex pool may ultimately be needed, I would start with a simpler approach and then reevaluate what should be the next step. ... The page queues lock is being used to synchronize changes to the page's dirty field and the PTE's PG_M bit against testing for dirty pages in the machine-independent code. ...
    (freebsd-arch)
  • Re: Question about vZOOM reference handling
    ... consumer message queues which go to two different consumer threads. ... But vZOOM still can be used here for readers. ... to just use a lock on the writer side. ...
    (comp.programming.threads)
  • Re: Packet loss every 30.999 seconds
    ... If packets were dropped they would show up ... is for the queue to be ipintrq for NETISR_IP. ... display drops for send queues and ip frags. ... Gbps NICs should have an rx ring size of 256 or 512 (I think the ...
    (freebsd-net)