Re: em(4) receive part wedging randomly at moderate load

From: Benjamin Rosenblum (ben_at_benswebs.com)
Date: 09/26/05

  • Next message: Petri Helenius: "Re: em(4) receive part wedging randomly at moderate load"
    Date: Mon, 26 Sep 2005 11:08:53 -0400
    To: net@freebsd.org
    
    

    the em driver in itself is extremly buggy. many people, myself
    included, are hitting some major problems with this driver that are
    causeing some serious issues. i cant transfer any large files to my
    server because the em driver panics and drops the connection for 15-20
    seconds. its a real pain in the *** when this happens too cause this
    is my primary network storage server. i have had to resort to the
    backup systems lately because of this problem. i think the entire em
    network driver needs to get reworked and all these bugs really need to
    be taken care of since this is one of the top like 3 network cards used
    in the field today for gig transfer.

    Gleb Smirnoff wrote:

    > Colleagues,
    >
    > during last month we are experiencing a nasty problem with em(4)
    >driver. Several times a day the receive path of the driver wedges
    >for a minute or two. During wedge the transmit part works with
    >no problems. The latter fact makes this problem very nasty, because
    >the problematic router can't be backed up with help of CARP.
    >
    >Some details: during the wedge all incoming packets are lost and
    >counted as "Missed packets". I've checked this using
    >`sysctl dev.em.0.stats=1`. The `dmesg` output is the following:
    >
    >em0: Excessive collisions = 0
    >em0: Symbol errors = 0
    >em0: Sequence errors = 0
    >em0: Defer count = 0
    >em0: Missed Packets = 1266
    >em0: Receive No Buffers = 220
    >em0: Receive length errors = 0
    >em0: Receive errors = 0
    >em0: Crc errors = 0
    >em0: Alignment errors = 0
    >em0: Carrier extension errors = 0
    >em0: XON Rcvd = 0
    >em0: XON Xmtd = 0
    >em0: XOFF Rcvd = 0
    >em0: XOFF Xmtd = 0
    >em0: Good Packets Rcvd = 28347789
    >em0: Good Packets Xmtd = 30911959
    >
    >There is a clear evidence that command `sysctl dev.em.0.stats=1` itself
    >can trigger the wedge. It is important, that the stats are printed
    >to a 9600 baud serial console, and this takes about a second. I have
    >suspicion, that the wedge happens when kernel doesn't service NIC
    >interrupts for some period of time. Yes, some packets should be lost in
    >this case, but the wedge must not continue for minutes!
    >
    >The box is serving 8 - 15 kpps, 70 - 100 MBps. It runs stateful pf(4)
    >firewall, with 50k - 80k states. The IP fastforwarding is enabled. The
    >average state insert/removal ratio is 300 states per second, however
    >sometimes several thousands of states can be removed in one pass. The
    >state removal locks the network code for quite a long time, so I guess
    >that wedge happens exactly when a lot of states are removed. The NIC
    >interrupts aren't serviced for some time and it wedges.
    >
    >The hardware is Supermicro server, with two onboard NICs:
    >
    >dev.em.0.%pnpinfo: vendor=0x8086 device=0x1075 subvendor=0x8086 subdevice=0x1075 class=0x020000
    >dev.em.1.%pnpinfo: vendor=0x8086 device=0x1076 subvendor=0x8086 subdevice=0x1076 class=0x020000
    >
    >The NIC is plugged in Cisco Catalyst 6509 gigabit ethernet port. No
    >errors are counted on switch port.
    >
    >To workaround the problem, I have made the following patch:
    >
    >@@ -1650,12 +1651,18 @@
    > struct ifnet *ifp;
    > struct adapter * adapter = arg;
    > ifp = adapter->ifp;
    >+ uint64_t ompc;
    >
    > EM_LOCK(adapter);
    >
    > em_check_for_link(&adapter->hw);
    > em_print_link_status(adapter);
    >- em_update_stats_counters(adapter);
    >+ ompc = adapter->stats.mpc;
    >+ em_update_stats_counters(adapter);
    >+ if (adapter->stats.mpc > ompc) {
    >+ printf("em watchdog: mpc %lld->%lld\n", ompc, adapter->stats.mpc);
    >+ em_init_locked(adapter);
    >+ }
    > if (em_display_debug_stats && ifp->if_drv_flags & IFF_DRV_RUNNING) {
    > em_print_hw_stats(adapter);
    > }
    >
    >It helps to reduce downtime from few minutes to 2 seconds, but this
    >is very dirty approach to the problem. Sample prints during runtime
    >with patch:
    >
    >em watchdog: mpc 1767->2739
    >em watchdog: mpc 2739->4724
    >em watchdog: mpc 4724->7794
    >em watchdog: mpc 7794->10729
    >
    >Every time this is printed, the network wedges for 2 seconds and then
    >it revives.
    >
    >I am asking developers, who work in Intel, to pay attention to this problem.
    >>From my side I can offer any help in testing and debugging.
    >
    >
    >

    _______________________________________________
    freebsd-net@freebsd.org mailing list
    http://lists.freebsd.org/mailman/listinfo/freebsd-net
    To unsubscribe, send any mail to "freebsd-net-unsubscribe@freebsd.org"


  • Next message: Petri Helenius: "Re: em(4) receive part wedging randomly at moderate load"