Re: Performance Intel Pro 1000 MT (PWLA8490MT)

From: Bruce Evans (bde_at_zeta.org.au)
Date: 04/20/05

  • Next message: Claus Guttesen: "Re: some simple nfs-benchmarks on 5.4 RC2"
    Date: Wed, 20 Apr 2005 13:19:44 +1000 (EST)
    To: Bosko Milekic <bmilekic@technokratis.com>
    
    

    On Tue, 19 Apr 2005, Bosko Milekic wrote:

    > My experience with 6.0-CURRENT has been that I am able to push at
    > least about 400kpps INTO THE KERNEL from a gigE em card on its own
    > 64-bit PCI-X 133MHz bus (i.e., the bus is uncontested) and that's

    A 64-bit bus doesn't seem to be essential for reasonable performance.

    I get about 210 kpps (receive) for a bge card on an old Athlon system
    with a 32-bit PCI 33MHz bus. Overclocking this bus speeds up at least
    sending almost proportionally to the overclocking :-). This is with
    my version of an old version of -current, with no mpsafenet, no driver
    tuning, and no mistuning (no INVARIANTS, etc., no POLLING, no HZ > 100).
    Sending goes slightly slower (about 200 kppps).

    I get about 220 kpps (send) for a much-maligned (last year) sk non-card
    on a much-maligned Athlon nForce2 newer Athlon system with a 32-bit
    PCI 33MHz bus. This is with a similar setup but with sending in the
    driver changed to not use the braindamaged sk interrupt moderation.
    The changes don't improve the throughput significantly since it is
    limited by the sk or bus to 4 us per packet, but they reduce interrupt
    overhead.

    > basically out of the box GENERIC on a dual-CPU box with HTT disabled
    > and no debugging options, with small 50-60 byte UDP packets.

    I used an old version of ttcp for testing. A small packet for me is
    5 bytes UDP data since that is the minimum that ttcp will send, but
    I repeated the tests with a packet size of 50 for comparison. For
    the sk, the throughput with a packet size of 5 is only slightly larger
    (240 kpps).

    There are some kernel deficiencies which at best break testing using
    simple programs like ttcp and at worst reduce throughput:
    - when the tx queue fills up, the application should stop sending, at
       least in the udp case, but there is no way for userland to tell
       when the queue becomes non-full so that it is useful to try to add
       to it -- select() doesn't work for this. Applications either have
       to waste cycles by retrying immediately or waste send slots by
       retrying after a short sleep.

       The old version of ttcp that I use uses the latter method, with a
       sleep interval of 1000 usec. This works poorly, especially with HZ
       = 100 (which gives an actual sleep interval of 10000 to 20000 usec),
       or with devices that have a smaller tx queue than sk (511). The tx
       queue always fills up when blasted with packets; it becomes non-full
       a few usec later after a tx interrupt, and it becomes empty a few
       usec or msec later, and then the transmitter is idle while ttcp
       sleeps. With sk and HZ = 100, throughput is reduced to approximately
       511 * (1000000 / 15000) = 34066 pps. HZ = 1000 is just large enough
       for the sleep to always be shorter than the tx draining time (2/HZ
       seconds = 2 msec < 4 * 511 usec = 2.044 msec), so transmission can
       stream.

       Newer versions of ttcp like the on in ports are aware of this problem
       but can't fix it since it is in the kernel. tools/netrate is less
       explicitly aware of this problem and can't fix it... However, if
       you don't care about using the sender for anything else and don't
       want to measure efficiency of sending, then retrying immediately can
       be used to generate almost the maximum pps. Parts of netrate do this.

    - the tx queue length is too small for all drivers, so the tx queue fills
       up too often. It defaults to IFQ_MAXLEN = 50. This may be right for
       1 Mbps ethernet or even for 10 Mbps ethernet, but it is too small for
       100 Mbps ethernet and far too small for 1000 Mbps ethernet. Drivers
       with a larger hardware tx queue length all bump it up to their tx
       queue length (often, bogusly, less 1), but it needs to be larger for
       transmission to stream. I use (SK_TX_RING_CNT + imax(2*tick, 10000) / 4)
       for sk.

    > My tests were done without polling so with very high interrupt load
    > and that also sucks when you have a high-traffic scenario.

    Interrupt load isn't necessarily very high, relevant or reduced by
    polling. For transmission, with non-broken hardware and software,
    there should be not many more than (pps / <size of hardware tx queue>)
    tx interrupts per second, and <size of hardware tx queue> should be
    small so that there aren't many txintrs/sec. For sk, this gives 240000
    / 511 = 489. After reprogramming sk's interrupt handling, I get 539.
    The standard driver used to get 7000+ with the old interrupt moderation
    timeout of 200 usec (actually 137 usec for Yukon, 200 for Genesis),
    and now 14000+ with an an interrupt moderation timeout of 200 (68.5)
    usec. The interrupt load for 539 txintrs/sec and 240 kpps is 10% on an
    AthlonXP2600 (Barton) overclocked. Very little of this is related to
    interrupts, so the term "interrupt load" is misleading. About 480
    packets are handled for every tx interrupt (512 less 32 for watermark
    stuff). Much more than 90% of the handling is useful work and would
    have to be done somewhere; it just happens to be done in the interrupt
    handler, and that is the best place to do it. With polling, it would
    take longer to do it and the load is poorly reported so it is hard to see.
    The system load for 539 txintrs/sec and 240 kpps is much larger. It
    is about 45% (up from 25% in RELENG_4 :-().

    [Context almost lost to top posting.]

    >>>> On 4/19/2005 1:32 PM, Eivind Hestnes wrote:
    >>>>
    >>>>> I have an Intel Pro 1000 MT (PWLA8490MT) NIC (em(4) driver 1.7.35)
    >>>>> installed
    >>>>> in a Pentium III 500 Mhz with 512 MB RAM (100 Mhz) running FreeBSD
    >>>>> 5.4-RC3.
    >>>>> The machine is routing traffic between multiple VLANs. Recently I did a
    >>>>> benchmark with/without device polling enabled. Without device
    >>>>> polling I was
    >>>>> able to transfer roughly 180 Mbit/s. The router however was
    >>>>> suffering when
    >>>>> doing this benchmark. Interrupt load was peaking 100% - overall the
    >>>>> system
    >>>>> itself was quite unusable (_very_ high system load).

    I think it is CPU-bound. My Athlon2600 (overclocked) is many times
    faster than your P3/500 (5-10 times?), but it doesn't have much CPU
    left over (sending 240000 5-byte udp packets per second from sk takes
    60% of the CPU, and sending 53000 1500-byte udp packets per second
    takes 30% of the CPU; sending tcp packets takes less CPU but goes
    slower). Apparently 2 or 3 P3/500's worth of CPU is needed just to
    keep up with the transmitter (with 100% of the CPU used but no
    transmission slots missed). RELENG_4 has lower overheads so it might
    need only 1 or 2 P3/500's worth of CPU to keep up.

    >>>>> With device
    >>>>> polling
    >>>>> enabled the interrupt kept stable around 40-50% and max transfer
    >>>>> rate was
    >>>>> nearly 70 Mbit/s. Not very scientific tests, but it gave me a pin
    >>>>> point.

    I don't believe in device polling. It's not surprising that it reduces
    throughput for a device that has large enough hardware queues. It just
    lets a machine that is too slow to handle 1Gbps ethernet (at least under
    FreeBSD) sort of work by not using the hardware to its full potentially.
    70 Mbit/s is still bad -- it's easy to get more than that with a 100Mbps
    NIC.

    >>>>> eivind@core-gw:~$ sysctl -a | grep kern.polling
    >>>>> ...
    >>>>> kern.polling.idle_poll: 0

    Setting this should increase throughput when the system is idle by taking
    100% of the CPU then. With just polling every 1 msec (from HZ = 1000),
    there are the same problems as with ttcp retrying every 10-20 msec, but
    scaled down by a factor of 10-20. For my ttcp example, the transmitter
    runs dry every 2.044 msec so the polling interval must be shorter than
    2.044 msec, but this is with a full hardare tx queue (511 entries) on
    a not very fast NIC. If the hardware is just twice as fast or the tx
    queue is just half as large of half as full, then the hardware tx queue
    it will run dry when polled every 1 msec and hardware capability will be
    wasted. This problem can be reduced by increasing HZ some more, but I
    don't believe in increasing it beyond 100, since only software that
    does too much polling would noticed it being larger.

    Bruce
    _______________________________________________
    freebsd-performance@freebsd.org mailing list
    http://lists.freebsd.org/mailman/listinfo/freebsd-performance
    To unsubscribe, send any mail to "freebsd-performance-unsubscribe@freebsd.org"


  • Next message: Claus Guttesen: "Re: some simple nfs-benchmarks on 5.4 RC2"

    Relevant Pages

    • Re: xl(4) & polling
      ... packet arrived at a network interface, the NIC generated an interrupt. ... Thus the concept of Device Polling came ... Instead whenever the packets arrive at a Network interface, ... the queue may fill up and subsequent packets are dropped. ...
      (freebsd-current)
    • Re: xl(4) & polling
      ... packet arrived at a network interface, the NIC generated an interrupt. ... Thus the concept of Device Polling came ... Instead whenever the packets arrive at a Network interface, ... the queue may fill up and subsequent packets are dropped. ...
      (freebsd-stable)
    • [PATCH 2/6]: powerpc/cell spidernet low watermark patch.
      ... Implement basic low-watermark support for the transmit queue. ... The device driver queues up a bunch of packets for the hardware ... The impelmentation is done by setting the DESCR_TXDESFLG flag ...
      (Linux-Kernel)
    • [PATCH 2/6]: powerpc/cell spidernet low watermark patch.
      ... Implement basic low-watermark support for the transmit queue. ... The device driver queues up a bunch of packets for the hardware ... The impelmentation is done by setting the DESCR_TXDESFLG flag ...
      (Linux-Kernel)
    • [PATCH 13/21]: powerpc/cell spidernet low watermark patch.
      ... Implement basic low-watermark support for the transmit queue. ... The device driver queues up a bunch of packets for the hardware ... The impelmentation is done by setting the DESCR_TXDESFLG flag ...
      (Linux-Kernel)