Tru64 server can't handle 900 network clients
From: Ole Holm Nielsen (ohnielse_at_fysik.dtu.dk)
Date: 09/17/04
- Previous message: Alexandre Vasconcelos: "Problems booting TruCluster first member"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Date: Fri, 17 Sep 2004 20:49:38 +0200 To: tru64-unix-managers@ornl.gov
I'm stumped by an apparent limit in the Tru64 UNIX kernel (v5.1A pk6)
to handle client node MAC-addresses for close to 1000 NFS clients.
We expanded our Linux cluster to 900+ nodes, and suddenly the
Tru64 UNIX NFS file-server randomly looses network communication
with many (or most) of the new nodes. A "ping" doesn't work at
either end of the server-client connection. Communication between
Linux servers and nodes works perfectly, however, so we do not
believe there to be a problem with the network setup.
What happens is I believe "ARP cache trashing": The Tru64 kernel
apparently can't cope with close to 1000 MAC-addresses simultaneously
because a fixed-size ARP cache fills up, and the kernel starts
deleting MAC-addresses from the ARP cache randomly. See "man 7 arp"
on a Linux box about the cache. On the Linux boxes we solve the
ARP cache problem by loading a static cache from the /etc/ethers file,
but on Tru64 UNIX this causes a dead-sure communications failure :-(
Browsing the Tru64 UNIX manuals and the "dxkerneltuner" tool, I
haven't been able to find any kernel parameter which may increase
the maximum size of the ARP cache. Can anyone help ?
Note: The 900 nodes are divided about equally between two Gigabit
interfaces on the Tru64 UNIX server.
Ole Holm Nielsen
Department of Physics, Technical University of Denmark,
Building 307, DK-2800 Kongens Lyngby, Denmark
- Previous message: Alexandre Vasconcelos: "Problems booting TruCluster first member"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Relevant Pages
|