Re: Changes in the network interface queueing handoff model



On Sunday 30 July 2006 16:04, Robert Watson wrote:
One of the ideas that I, Scott Long, and a few others have been bouncing
around for some time is a restructuring of the network interface packet
transmission API to reduce the number of locking operations and allow
network device drivers increased control of the queueing behavior. Right
now, it works something like that following:

- When a network protocol wants to transmit, it calls the ifnet's link
layer output routine via ifp->if_output() with the ifnet pointer, packet,
destination address information, and route information.

- The link layer (e.g., ether_output() + ether_output_frame()) encapsulates
the packet as necessary, performs a link layer address translation (such
as ARP), and hands off to the ifnet driver via a call to IFQ_HANDOFF(),
which accepts the ifnet pointer and packet.

- The ifnet layer enqueues the packet in the ifnet send queue
(ifp->if_snd), and then looks at the driver's IFF_DRV_OACTIVE flag to
determine if it needs to "start" output by the driver. If the driver is
already active, it doesn't, and otherwise, it does.

- The driver dequeues the packet from ifp->if_snd, performs any driver
encapsulation and wrapping, and notifies the hardware. In modern
hardware, this consists of hooking the data of the packet up to the
descriptor ring and notifying the hardware to pick it up via DMA. In order
hardware, the driver would perform a series of I/O operations to send the
entire packet directly to the card via a system bus.

Why change this? A few reasons:

- The ifnet layer send queue is becoming decreasingly useful over time.
Most modern hardware has a significant number of slots in its transmit
descriptor ring, tuned for the performance of the hardware, etc, which is
the effective transmit queue in practice. The additional queue depth
doesn't increase throughput substantially (if at all) but does consume
memory.

- On extremely fast hardware (with respect to CPU speed), the queue remains
essentially empty, so we pay the cost of enqueueing and dequeuing a
packet from an empty queue.

- The ifnet send queue is a separately locked object from the device
driver, meaning that for a single enqueue/dequeue pair, we pay an extra
four lock operations (two for insert, two for remove) per packet.

- For synthetic link layer drivers, such as if_vlan, which have no need for
queueing at all, the cost of queueing is eliminated.

- IFF_DRV_OACTIVE is no longer inspected by the link layer, only by the
driver, which helps eliminate a latent race condition involving use of
the flag.

The proposed change is simple: right now one or more enqueue operations
occurs, when a call to ifp->if_start() is made to notify the driver that it
may need to do something (if the ACTIVE flag isn't set). In the new world
order, the driver is directly passed the mbuf, and may then choose to queue
it or otherwise handle it as it sees fit. The immediate practical benefit
is clear: if the queueing at the ifnet layer is unnecessary, it is entirely
avoided, skipping enqueue, dequeue, and four mutex operations. This
applies immediately for VLAN processing, but also means that for modern
gigabit cards, the hardware queue (which will be used anyway) is the only
queue necessary.

There are a few downsides, of course:

- For older hardware without its own queueing, the queue is still required
-- not only that, but we've now introduced an unconditional function
pointer invocation, which on older hardware, is has more significant
relative cost than it has on more recent CPUs.

- If drivers still require or use a queue, they must now synchronize access
to the queue. The obvious choices are to use the ifq lock (and restore the
above four lock operations), or to use the driver mutex (and risk higher
contention). Right now, if the driver is busy (driver mutex held) then an
enqueue is still possible, but with this change and a single mutex
protecting the send queue and driver, that is no longer possible.

Attached is a patch that maintains the current if_start, but adds
if_startmbuf. If a device driver implements if_startmbuf and the global
sysctl net.startmbuf_enabled is set to 1, then the if_startmbuf path in the
driver will be used. Otherwise, if_start is used. I have modified the
if_em driver to implement if_startmbuf also. If there is no packet backlog
in the if_snd queue, it directly places the packet in the transmit
descriptor ring. If there is a backlog, it uses the if_snd queue protected
by driver mutex, rather than a separate ifq mutex.

In some basic local micro-benchmarks, I saw a 5% improvement in UDP 0-byte
paylod PPS on UP, and a 10% improvement on SMP. I saw a 1.7% performance
improvement in the bulk serving of 1k files over HTTP. These are only
micro-benchmarks, and reflect a configuration in which the CPU is unable to
keep up with the output rate of the 1gbps ethernet card in the device, so
reductions in host CPU usage are immediately visible in increased output as
the CPU is able to better keep up with the network hardware. Other
configurations are also of interest of interesting, especially ones in
which the network device is unable to keep up with the CPU, resulting in
more queueing.

Conceptual review as well as banchmarking, etc, would be most welcome.

This begs the question: What about ALTQ?

If we maintain the fallback mechanism in _handoff, we can just add
ALTQ_IS_ENABLED() to the test. Otherwise every driver's startmbuf function
would have to take care of ALTQ itself, which is not preferable.

I strongly agree with you comment about how messed up ifq_*/if_* in if_var.h
are - and I'm afraid that's partly me fault for bringing in ALTQ.

--
/"\ Best regards, | mlaier@xxxxxxxxxxx
\ / Max Laier | ICQ #67774661
X http://pf4freebsd.love2party.net/ | mlaier@EFnet
/ \ ASCII Ribbon Campaign | Against HTML Mail and News

Attachment: pgpcrs7U00bGm.pgp
Description: PGP signature



Relevant Pages

  • Changes in the network interface queueing handoff model
    ... 5BOne of the ideas that I, Scott Long, and a few others have been bouncing around for some time is a restructuring of the network interface packet transmission API to reduce the number of locking operations and allow network device drivers increased control of the queueing behavior. ... to "start" output by the driver. ... encapsulation and wrapping, and notifies the hardware. ... The ifnet layer send queue is becoming decreasingly useful over time. ...
    (freebsd-net)
  • Re: Changes in the network interface queueing handoff model
    ... bouncing around for some time is a restructuring of the network interface packet transmission API to reduce the number of locking operations and allow network device drivers increased control of the queueing behavior. ... to "start" output by the driver. ... encapsulation and wrapping, and notifies the hardware. ... The ifnet layer send queue is becoming decreasingly useful over time. ...
    (freebsd-arch)
  • Re: Changes in the network interface queueing handoff model
    ... bouncing around for some time is a restructuring of the network interface packet transmission API to reduce the number of locking operations and allow network device drivers increased control of the queueing behavior. ... to "start" output by the driver. ... encapsulation and wrapping, and notifies the hardware. ... The ifnet layer send queue is becoming decreasingly useful over time. ...
    (freebsd-net)
  • Changes in the network interface queueing handoff model
    ... 5BOne of the ideas that I, Scott Long, and a few others have been bouncing around for some time is a restructuring of the network interface packet transmission API to reduce the number of locking operations and allow network device drivers increased control of the queueing behavior. ... to "start" output by the driver. ... encapsulation and wrapping, and notifies the hardware. ... The ifnet layer send queue is becoming decreasingly useful over time. ...
    (freebsd-arch)
  • PATCH: Remove file riowinif.h from rio driver (unused file)
    ... -/* The RUP (Remote Unit Port) structure relates to the Remote Terminal Adapters ... - CONFIG is sent from the driver to configure an already opened port. ... - Packet structure is same as OPEN. ... - of the specified port's RTA address space. ...
    (Linux-Kernel)