Re: nfs-server silent data corruption




re,


Jeremy Chadwick <koitsu@xxxxxxxxxxx> writes:

On Mon, Apr 21, 2008 at 04:52:55PM +0200, Arno J. Klaassen wrote:
Kris Kennaway <kris@xxxxxxxxxxx> writes:
Uh, you're getting server-side data corruption, it could definitely be
because of the memory you added.

yop, though I'm still not convinced the memory is bad (the very same
Kingston ECC as the 2*1G in use for about half a year already) :

Can you download and run memtest86 on this system, with the added 2G ECC
insalled? memtest86 doesn't guarantee showing signs of memory problems,
but in most cases it'll start spewing errors almost immediately.


it finished in a bit less than 3 hours without a single error/warning

I feel pretty confident all memory is fine

One thing I did notice in the motherboard manual below is something
called "Hammer Configuration". It appears to default to 800MHz, but
there's an "Auto" choice. Does using Auto fix anything?

Nope

I added it directly to the 2nd CPU (diagram on page 9 of
http://www.tyan.com/manuals/m_s2895_101.pdf) and the problem
seems to be the interaction between nfe0 and powerd .... :

That board is the weirdest thing I've seen in years.


;) I agree I lifted (?) my eye-brows the first time I saw that
diagram


Two separate CPUs using a single (shared) memory controller, two
separate (and different!) nVidia chipsets, a SMSC I/O controller
probably used for serial and parallel I/O, two separate nVidia NICs with
Marvell PHYs (yet somehow you can bridge the two NICs and PHYs?), two
separate PCI-e busses (each associated with a separate nVidia chipset),
two separate PCI-X busses... the list continues.

some may say "it's just four wheels, an engine and a steer", she looks
different compared to most others


I know you don't need opinions at this point, but what a behemoth. I
can't imagine that thing running reliably.

though it does ;) (till the day I decided she deserved a -stable upgrade
and 2 more gigs ...)

- if I stop powerd, problems go away

This would imply that clock frequency stepping is somehow attributing
itself to the corruption. I don't see any BIOS options for controlling
things related to AMD's Cool-n-Quiet or PowerNow! feature, which is
usually what handles this.

you can turn it on/off; anyway, the problem *seems* easy to reproduce
when freq drops quickly form 2600Mhz to 1000Mhz ....
I just inspected a few corrupted copies, but out of 10-200Mbytes
just 1 byte was 0 iso \t

- I let run powerd but turn of txcsum and tso4 on the interface,
the problem is a lot harder to produce (if ever this gives
a hint to anyone)

Possibly shared interrupts are causing problems?


don't think so; I first had two Promise TX4 cards in this box iso
the Marvell 8port card; since I had problems with TX4 some time
ago I first suspected them. The board is still running memtest86, but
from the dmesg I posted I don't see a shared irq.

MSI/MSI-X doing
something odd? Have you tried disabling MSI/MSI-X and see if it makes a
difference?


MSI is disabled as is PCI-e Error reporting (or something like
that)


I think you mean "MAC LAN Bridge", according to the motherboard manual.
I'm not even sure what that really does; somehow trunks the two NICs
together to give you the equivalent of 2000mbit of traffic? I don't
know.

probably; I never tried ;) I need the second NIC for a seperate
subnet

Does the corruption you see go away if you install a separate NIC (e.g.
an Intel NIC) in a PCI or PCI-e slot, and disable the onboard NICs
(should be "MAC LAN: Disable" on both the primary and slave)?

Don't have one available right now (for a 2U server).
I will test if I do not find another solution.

Thanx, Arno
_______________________________________________
freebsd-net@xxxxxxxxxxx mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscribe@xxxxxxxxxxx"



Relevant Pages

  • Re: nfs-server silent data corruption
    ... because of the memory you added. ... Two separate CPUs using a single memory controller, ... separate nVidia chipsets, ... Marvell PHYs (yet somehow you can bridge the two NICs and PHYs?), ...
    (freebsd-net)
  • Re: nfs-server silent data corruption
    ... because of the memory you added. ... Two separate CPUs using a single memory controller, ... separate nVidia chipsets, ... Marvell PHYs (yet somehow you can bridge the two NICs and PHYs?), ...
    (freebsd-stable)
  • Re: nfs-server silent data corruption
    ... because of the memory you added. ... separate nVidia chipsets, ... Marvell PHYs (yet somehow you can bridge the two NICs and PHYs?), ... Have you tried disabling MSI/MSI-X and see if it makes a ...
    (freebsd-stable)
  • Re: HUGE problem with A7V8X-LA and data corruption issues! (Hewlett Packard System)
    ... Large file transfers across the network no matter what ... Ran memory tests, removed memory, used different ... for awhile than crapped again(I've tried various NICS and the ones I've ... updates on the machine & the data still corrupts. ...
    (alt.comp.periphs.mainboard.asus)
  • Re: Linux-VServer example results for sharing vs. separate mappings ...
    ... regarding the benefit of sharing over separate memory ... with 4GB of memory and a single 160GB SATA disk running ... the disk space used by one guests is roughly 148MB ... A container uses a software zone. ...
    (Linux-Kernel)