Re: suspect bug in vge(4)



On Tue, Jun 09, 2009 at 02:12:09AM +0200, Thomas Lotterer wrote:
I need advice hunting down a network problem which I suspect to be
a bug in the vge(4) driver. After spending a lot of time on
investigation, I'm out of ideas

My recently built new home server running FreeBSD 8.0-CURRENT as of
2009-06-07 on a VIA ARTiGO A2000 [1] exhibits network problems when
sending more than a couple of dozened kilobytes of TCP traffic.

The server application is "Dovecot" [2] Secure IMAP server.
The client application is "Thunderbird" [3] running on WindowsXP.

The high-level view of the problem is that the client seems to stall
downloading messages or even a complex structure of IMAP folder names.
When using STARTTLS the client often prints the infamous generic and
misleading error "Thunderbird received a message with incorrect Message
Authentication Code. If the error occurs frequently, contact the website
administrator". The origin of this message is the SSL library that ships
with Thunderbird. The same library is used for Firefox where the hint
might actually make sense when the user is attempting to access a broken
HTTPS server. After lots of debugging I found out that the same error is
not only printed for TLS/SSL issues but simply also for broken TCP
streams, let it be wrong TCP checksums or a server process dumping core.
So I tried IMAP without TLS just to see the same issue with the
misleading SSL error replaced by an application hang. I ran truss(1)
against Dovecot, placed Thunderbird in debug mode [4] and found out that
during a stall condition the server did write(2) all the data to the TCP
socket but some data did not arrive at the client.

The low-level view of the problem is that Wireshark on the client side
sooner or later - not for the first few dozened packets - sees a packet
with an incorrect TCP checksum. Usually the next packet is from the
server again, continuing the stream. What follows is an expected but
fruitless attempt of the client sending duplicate ACKs for the last good
packet but the server incorrectly retransmitting more TCP packets with
bad checksums.

To me it sounds like a broken implementation of hardware generated
checksums. Trying to disable all the "-tso" "-lro" "-txcsum" "-rxcsum"
options and using "polling" option on the server side network interface
did not help. So either something deeper is broken or maybe just the
ability to disable these features needs fixing. Btw, the client using
"VMware Accelerated AMD PCNet Adapter" driver with "TCP/IP Offload=off"
and "TsoEnable=0".

Sorry to bother you with more details but here's why I believe it's an
hardware/driver issue. Before I purchased the hardware I tried a dry
run. Installed FreeBSD 7.1-RELEASE as VM guest, then upgraded to FreeBSD
8.0-CURRENT using FreeBSD Administration Toolkit [5]. Built OS and apps
from source, loaded my data - worked! Used the same client that has
problems with the real hardware today. Then used that VM as build host
to create the NanoBSD [6] Flash image for the ARTiGO. Both use exactly
the same sources. The VM works, the metal is broken. One of the few
differences is the NIC and it's driver. As a workaround I copied the VM
to a usual PC equipped with a fxp(4) NIC - worked! So it really looks
like an OS/HW compatibility issue on the ARTiGO.

In case you are considering a hardware defect please note that before I
loaded the OS, apps and my data to this new hardware I thoroughly tested
what I could. One week filling the disks to the max using repetitive
copies of a file created from /dev/random and, after manually breaking
and rebuilding ZFS mirror, checking data integrity using message
digests. No problems with disks, albeit poor SATA performance, but
that's another story. One day running memtest86 [7]. No problems with
memory. One hour NIC test copying /dev/zero to /dev/null over the wire
using "scp -o compression=no". No hangs or hiccups here.

Hope you can help me.


I already know there are possible edge-cases in vge(4) but your
issue looks quite different one than ever reported. Unfortunately
vge(4) hardware I had was broken so I couldn't complete overhauling
the vge(4). The code in the following URL is the latest WIP version
but I don't know whether it fixes the issue as it wasn't tested at
all on real hardware.
http://people.freebsd.org/~yongari/vge/if_vge.c
http://people.freebsd.org/~yongari/vge/if_vgereg.h
http://people.freebsd.org/~yongari/vge/if_vgevar.h
_______________________________________________
freebsd-current@xxxxxxxxxxx mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe@xxxxxxxxxxx"



Relevant Pages

  • suspect bug in vge(4)
    ... The high-level view of the problem is that the client seems to stall ... HTTPS server. ... not only printed for TLS/SSL issues but simply also for broken TCP ... To me it sounds like a broken implementation of hardware generated ...
    (freebsd-current)
  • Re: NFS Locking Issue
    ... to FreeBSD 6.x and later. ... Turn off rpc.lockd on either the server or client before the cp command, ... At one point we had in our test network a 6.1 NFS server providing files to 5.4 diskless clients without any problems. ...
    (freebsd-stable)
  • Re: FreeBSD workstation on Windows network?
    ... Perhaps I could use the FreeBSD machine to make my home network more secure. ... Maybe I could use it as a web server. ... software (and hardware) and more flexibility. ... even faster ) I can't afford a new router ...
    (comp.unix.bsd.freebsd.misc)
  • Re: Dhcp client issue
    ... Would you care to elaborate why is the server expecting some non-default ... dhcp server catches this packet (which already has a MAC address of the source ... The bottom line is that you can't trust the MAC address, especially when the client and server aren't physically adjacent. ... It is expected that this field will typically contain a hardware type ...
    (Fedora)
  • Re: NFS Locking Issue
    ... transfered from the server to the client. ... With FreeBSD-6.1 as client (using an Intel ... the NFS server on FreeBSD is mucked up, ... So it may well be that it is the FreeBSD NFS server code which has problems. ...
    (freebsd-stable)