Re: Advice on a multithreaded netisr patch?
- From: Ivan Voras <ivoras@xxxxxxxxxxx>
- Date: Mon, 06 Apr 2009 14:35:33 +0200
Robert Watson wrote:
On Mon, 6 Apr 2009, Ivan Voras wrote:
So, a mbuf can reference data not yet copied from the NIC hardware?
I'm specifically trying to undestand what m_pullup() does.
I think we're talking slightly at cross purposes. There are two
transfers of interest:
(1) DMA of the packet data to main memory from the NIC
(2) Servicing of CPU cache misses to access data in main memory
By the time you receive an interrupt, the DMA is complete, so once you
OK, this was what was confusing me - for a moment I thought you meant
it's not so.
believe a packet referenced by the descriptor ring is done, you don't
have to wait for DMA. However, the packet data is in main memory rather
than your CPU cache, so you'll need to take a cache miss in order to
retrieve it. You don't want to prefetch before you know the packet data
is there, or you may prefetch stale data from the previous packet sent
or received from the cluster.
m_pullup() has to do with mbuf chain memory contiguity during packet
processing. The usual usage is something along the following lines:
struct whatever *w;
m = m_pullup(m, sizeof(*w));
if (m == NULL)
return;
w = mtod(m, struct whatever *);
m_pullup() here ensures that the first sizeof(*w) bytes of mbuf data are
contiguously stored so that the cast of w to m's data will point at a
So, m_pullup() can resize / realloc() the mbuf? (not that it matters for
this purpose)
Is this for the loopback workload? If so, remember that there may be
some other things going on:
Both loopback and physical.
- Every packet is processed at least two times: once went sent, and then
again
when it's received.
- A TCP segment will need to be ACK'd, so if you're sending data in
chunks in
one direction, the ACKs will not be piggy-backed on existing data
tranfers,
and instead be sent independently, hitting the network stack two more
times.
No combination of these can make an accounting difference between 1,000
and 250,000 pps. I must be hitting something very bad here.
- Remember that TCP works to expand its window, and then maintains the
highest
performance it can by bumping up against the top of available bandwidth
continuously. This involves detecting buffer limits by generating
packets
that can't be sent, adding to the packet count. With loopback
traffic, the
drop point occurs when you exceed the size of the netisr's queue for
IP, so
you might try bumping that from the default to something much larger.
My messages are approx. 100 +/- 10 bytes. No practical way they will
even span multiple mbufs. TCP_NODELAY is on.
No. x++ is massively slow if executed in parallel across many cores on
a variable in a single cache line. See my recent commit to kern_tc.c
for an example: the updating of trivial statistics for the kernel time
calls reduced 30m syscalls/second to 3m syscalls/second due to heavy
contention on the cache line holding the statistic. One of my goals for
I don't get it:
http://svn.freebsd.org/viewvc/base/stable/7/sys/kern/kern_tc.c?r1=189891&r2=189890&pathrev=189891
you replaced x++ with no-ops if TC_COUNTER is defined? Aren't the
timecounters actually needed somewhere?
8.0 is to fix this problem for IP and TCP layers, and ideally also ifnet
but we'll see. We should be maintaining those stats per-CPU and then
aggregating to report them to userspace. This is what we already do for
a number of system stats -- UMA and kernel malloc, syscall and trap
counters, etc.
How magic is this? Is it just a matter of declaring mystatarray[NCPU]
and updating mystat[current_cpu] or (probably), the spacing between
array elements should be magically fixed so two elements don't share a
cache line?
- Use cpuset to pin ithreads, the netisr, and whatever else, to specific
cores
so that they don't migrate, and if your system uses HTT, experiment
with
pinning the ithread and the netisr on different threads on the same
core, or
at least, different cores on the same die.
I'm using em hardware; I still think there's a possibility I'm
fighting the driver in some cases but this has priority #2.
Have you tried LOCK_PROFILING? It would quickly tell you if driver
locks were a source of significant contention. It works quite well...
I don't think I'm fighting against locking artifacts, it looks more like
some kind of overly smart hardware thing, like interrupt moderation (but
not exactly interrupt moderation since the number of IRQs/s remains
approx. the same).
- If your card supports RSS, pass the flowid up the stack in the mbuf
packet
header flowid field, and use that instead of the hash for work
placement.
Don't know about em. Don't really want to touch it if I don't have to :)
if_em doesn't support it, but if_igb does. If this saves you a minimum
of one and possibly two cache misses per packet, it could be a huge
performance improvement.
If I had the funds to upgrade hardware, I wouldn't be so interested in
solving it in software :)
Attachment:
signature.asc
Description: OpenPGP digital signature
- Follow-Ups:
- Re: Advice on a multithreaded netisr patch?
- From: Robert Watson
- Re: Advice on a multithreaded netisr patch?
- From: Barney Cordoba
- Re: Advice on a multithreaded netisr patch?
- From: Barney Cordoba
- Re: Advice on a multithreaded netisr patch?
- References:
- Advice on a multithreaded netisr patch?
- From: Ivan Voras
- Re: Advice on a multithreaded netisr patch?
- From: Robert Watson
- Re: Advice on a multithreaded netisr patch?
- From: Ivan Voras
- Re: Advice on a multithreaded netisr patch?
- From: Robert Watson
- Re: Advice on a multithreaded netisr patch?
- From: Ivan Voras
- Re: Advice on a multithreaded netisr patch?
- From: Robert Watson
- Re: Advice on a multithreaded netisr patch?
- From: Ivan Voras
- Re: Advice on a multithreaded netisr patch?
- From: Robert Watson
- Advice on a multithreaded netisr patch?
- Prev by Date: Re: Advice on a multithreaded netisr patch?
- Next by Date: Re: Advice on a multithreaded netisr patch?
- Previous by thread: Re: Advice on a multithreaded netisr patch?
- Next by thread: Re: Advice on a multithreaded netisr patch?
- Index(es):
Relevant Pages
|