Advice on a multithreaded netisr patch?



Hi,

I'm developing an application that needs a high rate of small TCP
transactions on multi-core systems, and I'm hitting a limit where a
kernel task, usually swi:net (but it depends on the driver) hits 100% of
a CPU at some transactions/s rate and blocks further performance
increase even though other cores are 100% idle.

So I've got an idea and tested it out, but it fails in an unexpected
way. I'm not very familiar with the network code so I'm probably missing
something obvious. The idea was to locate where the packet processing
takes place and offload packets to several new kernel threads. I see
this can happen in several places - netisr, ip_input and tcp_input, and
I chose netisr because I thought maybe it would also help other uses
(routing?). Here's a patch against CURRENT:

http://people.freebsd.org/~ivoras/diffs/mpip.patch

It's fairly simple - starts a configurable number of threads in
start_netisr(), assigns circular queues to each, and modifies what I
think are entry points for packets in the non-netisr.direct case. I also
try to have TCP and UDP traffic from the same host+port processed by the
same thread. It has some rough edges but I think this is enough to test
the idea. I know that there are several people officially working in
this area and I'm not an expert in it so think of it as a weekend hack
for learning purposes :)

These parameters are needed in loader.conf to test it:

net.isr.direct=0
net.isr.mtdispatch_n_threads=2

I expected things like the contention in upper layers (TCP) leading to
not improving performance one bit, but I can't explain what I'm getting
here. While testing the application on a plain kernel, I get approx.
100,000 - 120,000 packets/s per direction (by looking at "netstat 1")
and a similar number of transactions/s in the application. With the
patch I get up to 250,000 packets/s in netstat (3 mtdispatch threads),
but for some weird reason the actual number of transactions processed by
the application drops to less than 1,000 at the beginning (~~ 30
seconds), then jumps to close to 100,000 transactions/s, with netstat
also showing a drop this number of packets. In the first phase, the new
threads (netd0..3) are using CPU time almost 100%, in the second phase I
can't see where the CPU time is going (using top).

I thought this has something to deal with NIC moderation (em) but can't
really explain it. The bad performance part (not the jump) is also
visible over the loopback interface.

Any ideas?

Attachment: signature.asc
Description: OpenPGP digital signature



Relevant Pages

  • Re: silent semantic changes with reiser4
    ... > Several do TCP in user space. ... mis-behaving (and I'm not saying intentionally so: it might be a small bug ... They will ban an OS if it sends out packets ... that you have another protection domain (aka "kernel" or "TCP deamon") ...
    (Linux-Kernel)
  • Re: cwnd and sstresh monitor
    ... (kernel patch, kernel module, etc?), and how would this be done best? ... but there is a TCPDEBUG kernel option that gathers TCP state information for debugging and tracing purposes. ... I also modified the iptimefunction to provide microsecond resolution instead of miliseconds, because most of the packets have the same timestamp attached. ...
    (freebsd-hackers)
  • Re: TCP library
    ... |> my program instead of the normal TCP code? ... |> belong to the kernel as far as I know. ... | can also get the reply packets from the interface. ...
    (comp.os.linux.development.system)
  • RE: Entirely ignoring TCP and UDP checksum in kernel level
    ... receive packets from an internal port, it sends packets via its external IP ... I have managed to disable IP header checksumming by hacking the kernel (in ... Now after disabling IP protocol checksumming, ... Entirely ignoring TCP and UDP checksum in kernel level ...
    (Linux-Kernel)
  • Re: Advice on a multithreaded netisr patch?
    ... transactions on multi-core systems, and I'm hitting a limit where a kernel task, usually swi:net hits 100% of a CPU at some transactions/s rate and blocks further performance increase even though other cores are 100% idle. ... One of the fundamental problems with hashing packets to distribute work is that it involves taking cache misses on packet headers, not just once, but twice, which often is one of the largest costs in processing packets. ... Kip Macy has patches to support multiple output queues on cxgb, which should facilitate support for other drivers as well, and the plan is to get that in 8.0 as well. ... We're, briefly, in a period where input queue count is about the same density as CPU cores; it's not entirely clear, but we may soon be back in a situation where CPU core count exceeds queues, in which case doing software work placement will continue to be important. ...
    (freebsd-net)