4.7 vs 5.2.1 SMP/UP bridging performance

From: Gerrit Nagelhout (gnagelhout_at_sandvine.com)
Date: 05/04/04

  • Next message: Kris Kennaway: "wmnet-1.2 broken by if_poll_slowq"
    To: freebsd-current@FreeBSD.org
    Date: Tue, 4 May 2004 15:55:32 -0400 
    
    

    Hi,

    For one of our applications in our testlab, we are running bridge(4)
    with several user land applications. I have found that the bridging
    performance (64 byte packets, 2-port bridge) on 5.2.1 is
    significantly lower than that of RELENG_4, especially when running in
    SMP. The platform is a dual 2.8GHz xeon with a dual port em (100MHz
    PCI-X). Invariants are disabled, and polling (with idle_polling
    enabled) is used.

    Here are the various test results (packets per second, full duplex)
    [traffic generator] <=> [FreeBSD bridge] <=> [traffic generator]

            4.7 UP: 1.2Mpps
            4.7 SMP : 1.2Mpps
            5.2.1 UP: 850Kpps
            5.2.1 SMP: 500Kpps

    I believe that for RELENG_4, the hardware is the bottleneck, which
    explains why there is no difference between UP and SMP.
    In order to get these numbers for 5.2.1, I had to make a small change
    to bridge.c (change ETHER_ADDR_EQ to BDG_MATCH in bridge_in to avoid
    calling bcmp). This change boosted performance by about 20%

    I ran the kernel profiler for both UP and SMP (5.2.1), and included
    the results of the top functions below. In the past, I have run the
    profiler against RELENG_4 also, and the main difference with that
    (explaining reduced UP performance) is more overhead due to bus_dma &
    mbuf handling. When I compare the results of UP & SMP (5.2.1), all
    the functions using mutexes seem to get much more expensive, and
    critical_exit is taking more cycles. A quick count of mutexes in the
    bridge code path showed that there were 10-20 locks & unlocks for
    each packet. When as a quick test I added 10 more locks/unlocks to
    the code path, the SMP performance when down to 330Kpps. This
    indicates that mutexes are much more expensive in SMP than in UP.

    I would like to move to CURRENT for new hardware support, and the
    ability to properly use multi-threading in user-space, but can't do
    this until the performance bottlenecks are solved. I realize that
    5.x is still a work in progress and hasn't been tuned as well as 4.7
    yet, but are there any plans for optimizations in this area? Does
    anyone have any suggestions on what else I can try?

    Thanks,

    Gerrit

    (wheel)# sysctl net.link.ether.bridge
    net.link.ether.bridge.version: $Revision: 1.72 $ $Date: 2003/10/31 18:32:08
    $
    net.link.ether.bridge.debug: 0
    net.link.ether.bridge.ipf: 0
    net.link.ether.bridge.ipfw: 0
    net.link.ether.bridge.copy: 0
    net.link.ether.bridge.ipfw_drop: 0
    net.link.ether.bridge.ipfw_collisions: 0
    net.link.ether.bridge.packets: 1299855421
    net.link.ether.bridge.dropped: 0
    net.link.ether.bridge.predict: 0
    net.link.ether.bridge.enable: 1
    net.link.ether.bridge.config: em0:1,em1:1

    (wheel)# sysctl kern.polling
    kern.polling.burst: 19
    kern.polling.each_burst: 80
    kern.polling.burst_max: 1000
    kern.polling.idle_poll: 1
    kern.polling.poll_in_trap: 0
    kern.polling.user_frac: 5
    kern.polling.reg_frac: 120
    kern.polling.short_ticks: 0
    kern.polling.lost_polls: 4297586
    kern.polling.pending_polls: 0
    kern.polling.residual_burst: 0
    kern.polling.handlers: 3
    kern.polling.enable: 1
    kern.polling.phase: 0
    kern.polling.suspect: 1030517
    kern.polling.stalled: 40
    kern.polling.idlepoll_sleeping: 0

    Here are some of the interesting parts of the config file:
    options HZ=2500
    options NMBCLUSTERS=32768
    #options GDB_REMOTE_CHAT
    #options INVARIANTS
    #options INVARIANT_SUPPORT
    #options DIAGNOSTIC

    options DEVICE_POLLING

    The following profiles show only the top functions (more than 0.2%):

    UP:

    granularity: each sample hit covers 16 byte(s) for 0.01% of 10.01 seconds
              
      % cumulative self self total
     time seconds seconds calls ms/call ms/call name
     20.3 2.03 2.03 ether_input [1]
     10.5 3.09 1.06 mb_free [2]
      5.8 3.67 0.58
    _bus_dmamap_load_buffer [3]
      5.6 4.23 0.56 m_getcl [4]
      5.3 4.76 0.53 em_encap [5]
      5.1 5.27 0.51 m_free [6]
      5.1 5.78 0.51 mb_alloc [7]
      4.9 6.27 0.49 bdg_forward [8]
      4.9 6.76 0.49
    em_process_receive_interrupts [9]
      4.1 7.17 0.41 bridge_in [10]
      3.6 7.53 0.36 generic_bcopy [11]
      3.6 7.89 0.36 m_freem [12]
      2.6 8.14 0.26 em_get_buf [13]
      2.2 8.37 0.22
    em_clean_transmit_interrupts [14]
      2.2 8.59 0.22 em_start_locked [15]
      2.0 8.79 0.20 bus_dmamap_load_mbuf
    [16]
      1.9 8.99 0.19 bus_dmamap_load [17]
      1.3 9.11 0.13 critical_exit [18]
      1.1 9.23 0.11 em_start [19]
      1.0 9.32 0.10 bus_dmamap_create [20]
      0.8 9.40 0.08 em_receive_checksum
    [21]
      0.6 9.46 0.06 em_tx_cb [22]
      0.5 9.52 0.05 __mcount [23]
      0.5 9.57 0.05
    em_transmit_checksum_setup [24]
      0.5 9.62 0.05 m_tag_delete_chain
    [25]
      0.5 9.66 0.05 m_adj [26]
      0.3 9.69 0.03 mb_pop_cont [27]
      0.2 9.71 0.02 bus_dmamap_destroy
    [28]
      0.2 9.73 0.02 mb_reclaim [29]
      0.2 9.75 0.02 ether_ipfw_chk [30]
      0.2 9.77 0.02 em_dmamap_cb [31]

    SMP:

    granularity: each sample hit covers 16 byte(s) for 0.00% of 20.14 seconds

      % cumulative self self total
     time seconds seconds calls ms/call ms/call name
     47.9 9.64 9.64 cpu_idle_default [1]
      4.9 10.63 0.99 critical_exit [2]
      4.6 11.56 0.93 mb_free [3]
      4.3 12.41 0.86 bridge_in [4]
      4.2 13.26 0.84 bdg_forward [5]
      4.1 14.08 0.82 mb_alloc [6]
      3.9 14.87 0.79
    em_process_receive_interrupts [7]
      3.2 15.52 0.65 em_start [8]
      3.1 16.15 0.63 m_free [9]
      3.0 16.76 0.61
    _bus_dmamap_load_buffer [10]
      2.5 17.27 0.51 m_getcl [11]
      2.1 17.69 0.42 em_start_locked [12]
      1.9 18.07 0.37 ether_input [13]
      1.5 18.38 0.31 em_encap [14]
      1.1 18.61 0.23 bus_dmamap_load [15]
      1.0 18.82 0.21 generic_bcopy [16]
      0.9 19.00 0.18 bus_dmamap_load_mbuf
    [17]
      0.8 19.16 0.17 __mcount [18]
      0.6 19.29 0.13 em_get_buf [19]
      0.6 19.41 0.12
    em_clean_transmit_interrupts [20]
      0.5 19.52 0.11 em_receive_checksum
    [21]
      0.4 19.60 0.09 m_gethdr_clrd [22]
      0.4 19.69 0.08 bus_dmamap_create [23]
      0.3 19.75 0.06 em_tx_cb [24]
      0.2 19.80 0.05 m_freem [25]
      0.2 19.83 0.03 m_adj [26]
      0.1 19.85 0.02 m_tag_delete_chain
    [27]
      0.1 19.87 0.02 bus_dmamap_destroy
    [28]
      0.1 19.89 0.02 mb_pop_cont [29]
      0.1 19.91 0.02 em_dmamap_cb [30]
      0.1 19.92 0.02
    em_transmit_checksum_setup [31]
      0.1 19.94 0.01 mb_alloc_wait [32]
      0.1 19.95 0.01 em_poll [33]

    _______________________________________________
    freebsd-current@freebsd.org mailing list
    http://lists.freebsd.org/mailman/listinfo/freebsd-current
    To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"


  • Next message: Kris Kennaway: "wmnet-1.2 broken by if_poll_slowq"

    Relevant Pages

    • RE: memory mapped packet capturing - bpf replacement ?
      ... packets with the TSC plus a simple base offset to correct for variences ... :>> Is this in a SMP or uniprocesor environment? ... :> ng_bpf does BPF filtering, ... :> from the hook, and it can arrive to the hook from whatever source. ...
      (freebsd-hackers)
    • Re: 4.7 vs 5.2.1 SMP/UP bridging performance
      ... > with several user land applications. ... > explains why there is no difference between UP and SMP. ... > I ran the kernel profiler for both UP and SMP, ... A quick count of mutexes in the ...
      (freebsd-current)
    • Re: UDP on an SMP system
      ... > There all packets where in correct sequence. ... The work-around till a proper fix is to use cpu ... status of your nic driver re: smp if you run into problems. ...
      (comp.os.linux.networking)
    • Re: Routing SMP benefit
      ... CPUs process packets from the same interface in parallel? ... The second CPU on SMP keeps on doing all userland tasks and running routing protocols. ... I have tested on 6R with fastforwarding and net.isr.direct and found that by them selves they don't compare in network performance boosts compared to enabling polling, but you have made me feel like retesting, this is on 6R or stable though. ...
      (freebsd-net)