Re: Some performance measurements on the FreeBSD network stack

On Thu, Apr 19, 2012 at 10:05:37PM +0200, Andre Oppermann wrote:
On 19.04.2012 15:30, Luigi Rizzo wrote:
I have been running some performance tests on UDP sockets,
using the netsend program in tools/tools/netrate/netsend
and instrumenting the source code and the kernel do return in
various points of the path. Here are some results which
I hope you find interesting.

Jumping over very interesting analysis...

- the next expensive operation, consuming another 100ns,
is the mbuf allocation in m_uiotombuf(). Nevertheless, the allocator
seems to scale decently at least with 4 cores. The copyin() is
relatively inexpensive (not reported in the data below, but
disabling it saves only 15-20ns for a short packet).

I have not followed the details, but the allocator calls the zone
allocator and there is at least one critical_enter()/critical_exit()
pair, and the highly modular architecture invokes long chains of
indirect function calls both on allocation and release.

It might make sense to keep a small pool of mbufs attached to the
socket buffer instead of going to the zone allocator.
Or defer the actual encapsulation to the
(*so->so_proto->pr_usrreqs->pru_send)() which is called inline, anyways.

The UMA mbuf allocator is certainly not perfect but rather good.
It has a per-CPU cache of mbuf's that are very fast to allocate
from. Once it has used them it needs to refill from the global
pool which may happen from time to time and show up in the averages.

indeed i was pleased to see no difference between 1 and 4 threads.
This also suggests that the global pool is accessed very seldom,
and for short times, otherwise you'd see the effect with 4 threads.

What might be moderately expensive are the critical_enter()/critical_exit()
calls around individual allocations.
The allocation happens while the code has already an exclusive
lock on so->snd_buf so a pool of fresh buffers could be attached

But the other consideration is that one could defer the mbuf allocation
to a later time when the packet is actually built (or anyways
right before the thread returns).
What i envision (and this would fit nicely with netmap) is the following:
- have a (possibly readonly) template for the headers (MAC+IP+UDP)
attached to the socket, built on demand, and cached and managed
with similar invalidation rules as used by fastforward;
- possibly extend the pru_send interface so one can pass down the uio
instead of the mbuf;
- make an opportunistic buffer allocation in some place downstream,
where the code already has an x-lock on some resource (could be
the snd_buf, the interface, ...) so the allocation comes for free.

- another big bottleneck is the route lookup in ip_output()
(between entries 51 and 56). Not only it eats another
100ns+ on an empty routing table, but it also
causes huge contentions when multiple cores
are involved.

This is indeed a big problem. I'm working (rough edges remain) on
changing the routing table locking to an rmlock (read-mostly) which

i was wondering, is there a way (and/or any advantage) to use the
fastforward code to look up the route for locally sourced packets ?

freebsd-current@xxxxxxxxxxx mailing list
To unsubscribe, send any mail to "freebsd-current-unsubscribe@xxxxxxxxxxx"