SUMMARY: Tru64 server can't handle 900 network clients
- From: Ole Holm Nielsen <Ole.H.Nielsen@xxxxxxxxxxxx>
- Date: Thu, 22 Dec 2005 09:36:42 +0100
This is an old question, but for anyone with >512 machines on the local network you need to know how to increase the Ethernet ARP cache size in Tru64 UNIX. I received a resolution of the problem from an HP Denmark consultant:
You need to look at and possibly increase the Tru64 kernel's internal variable "arpqmaxlen", which unfortunately cannot be set through the usual /etc/sysconfigtab method. This variable is the number of Ethernet MAC addresses kept in the cache, and should be somewhat larger than 2 times the number of nodes on your network. The kernel variables related to the ARP cache are defined in /usr/sys/include/netinet/inet_config.h.
To display the "arpqmaxlen" value use /usr/bin/dbx on the kernel: # dbx -k /vmunix (dbx) p arpqmaxlen 1024 To assign a new value until next reboot: (dbx) assign arpqmaxlen = 2048 To assign a new value permanently in /vmunix: (dbx) patch arpqmaxlen = 2048 Then exit dbx by a "quit" command. If a new kernel gets installed, for example by installing a new Patch Kit, you will need to modify /vmunix again as described.
We've been running a local network with about 950 nodes without ARP cache problems for over a year now, so this solution seems to be well tested.
Additional note in case anyone is interested: On Linux hosts the same modification can be implemented via the /etc/sysctl.conf file (Redhat RHEL4 with kernel 2.6.9) at boot time:
# Don't allow the arp table to become bigger than this net.ipv4.neigh.default.gc_thresh3 = 4096 # Tell the gc when to become aggressive with arp table cleaning. # Adjust this based on size of the LAN. net.ipv4.neigh.default.gc_thresh2 = 2048 # Adjust where the gc will leave arp table alone net.ipv4.neigh.default.gc_thresh1 = 1024 # Adjust to arp table gc to clean-up more often net.ipv4.neigh.default.gc_interval = 3600 # ARP cache entry timeout net.ipv4.neigh.default.gc_stale_time = 3600
Ole Holm Nielsen wrote:
I'm stumped by an apparent limit in the Tru64 UNIX kernel (v5.1A pk6) to handle client node MAC-addresses for close to 1000 NFS clients. We expanded our Linux cluster to 900+ nodes, and suddenly the Tru64 UNIX NFS file-server randomly looses network communication with many (or most) of the new nodes. A "ping" doesn't work at either end of the server-client connection. Communication between Linux servers and nodes works perfectly, however, so we do not believe there to be a problem with the network setup.
What happens is I believe "ARP cache trashing": The Tru64 kernel apparently can't cope with close to 1000 MAC-addresses simultaneously because a fixed-size ARP cache fills up, and the kernel starts deleting MAC-addresses from the ARP cache randomly. See "man 7 arp" on a Linux box about the cache. On the Linux boxes we solve the ARP cache problem by loading a static cache from the /etc/ethers file, but on Tru64 UNIX this causes a dead-sure communications failure :-(
Browsing the Tru64 UNIX manuals and the "dxkerneltuner" tool, I haven't been able to find any kernel parameter which may increase the maximum size of the ARP cache. Can anyone help ? Note: The 900 nodes are divided about equally between two Gigabit interfaces on the Tru64 UNIX server.
-- Ole Holm Nielsen Department of Physics, Technical University of Denmark, Building 307, DK-2800 Kongens Lyngby, Denmark
- Prev by Date: Memory Specs?
- Next by Date: SUMMARY: vrestore hangs
- Previous by thread: Memory Specs?
- Next by thread: SUMMARY: vrestore hangs
- Index(es):
Relevant Pages
|
|