Re: 'nfe' stalls (analysis and partial solution)
- From: Pyun YongHyeon <pyunyh@xxxxxxxxx>
- Date: Sat, 26 Apr 2008 14:33:37 +0900
On Fri, Apr 25, 2008 at 06:00:39PM +0200, Luigi Rizzo wrote:
just for the record and the mail archives - i have been experiencing
a lot of unrecovered stalls of the network card with the 'nfe'
driver under heavy load (this was on 7.0-i386 and 7.0-amd64, but
it is hardware related so it cross-platform).
After 2-3 days of investigation, and with the help of
Pyun YongHyeon (yongari) i finally managed to pin down the
problem and start working on a solution.
I would be grateful if others can report of similar problems
with the 'nfe' driver so we can see if the patch we can come
up with also fix their problem.
THE PROBLEM:
under heavy load (e.g. full speed ssh transfers, disk activity,
Xwindows...) causing the receive ring to fill up, it seems that
some nfe-supported cards (at least the MCP67) enter a state where
they stop looking at the ring buffers and drop incoming packets.
The driver does not recover from the error so you manually have
to 'ifconfig down; ifconfig up' the interface to restart
receiving.
I tried to reprocude this on CK804 MCP9 hardware but nfe(4)
recovered successfully from this Rx ring full condition.
Of course, I still don't know how to reliably reproduce Rx stalls
but just Rx ring full condition doesn't seem to trigger Rx stalls
on CK804 MCP9.
As Luigi said, it's also possible only some NVIDIA chips can have
this issue. If you happen to see this issue please let us know what
chip/model you have.
The Rx ring full condition could be easily triggered by sending
lots of UDP packets with network benchmark programs. In order to
increase the possibility of the Rx ring full condition, running
buildworld while benchmark test is in progress would certainly
trigger the condition.
SOLUTION:--
I have not yet determined the exact conditions causing the error,
so as a temporary workaround i am calling nfe_init_locked() every
from the watchdog routine every time a receive error of some kind
is experienced.
I definitely need to apply stricter checks on the error condition,
but some more extra card reset is certainly better than losing contact
with the machine. Unfortunately there is no documentation on this
behaviour of the card, and the linux driver (forcedeth) has no
error checking/recovery at all so it is of no help.
cheers
luigi
Regards,
Pyun YongHyeon
_______________________________________________
freebsd-net@xxxxxxxxxxx mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscribe@xxxxxxxxxxx"
- References:
- 'nfe' stalls (analysis and partial solution)
- From: Luigi Rizzo
- 'nfe' stalls (analysis and partial solution)
- Prev by Date: Re: kern/122875: "rstatd: Can't get namelist. 1" - fbsd 7.0-stable (works ok in 7.0-release)
- Next by Date: Re: Crash with recent kernel on wireless
- Previous by thread: 'nfe' stalls (analysis and partial solution)
- Next by thread: Re: 'nfe' stalls (analysis and partial solution)
- Index(es):
Relevant Pages
|