SUMMARY: Tru64 server can't handle 900 network clients



This is an old question, but for anyone with >512 machines on the local
network you need to know how to increase the Ethernet ARP cache size
in Tru64 UNIX.  I received a resolution of the problem from an HP Denmark
consultant:

You need to look at and possibly increase the Tru64 kernel's internal
variable "arpqmaxlen", which unfortunately cannot be set through the
usual /etc/sysconfigtab method.  This variable is the number of
Ethernet MAC addresses kept in the cache, and should be somewhat
larger than 2 times the number of nodes on your network.  The kernel
variables related to the ARP cache are defined in
/usr/sys/include/netinet/inet_config.h.

To display the "arpqmaxlen" value use /usr/bin/dbx on the kernel:
   # dbx -k /vmunix
   (dbx) p arpqmaxlen
   1024
To assign a new value until next reboot:
   (dbx) assign arpqmaxlen = 2048
To assign a new value permanently in /vmunix:
   (dbx) patch arpqmaxlen = 2048
Then exit dbx by a "quit" command.  If a new kernel gets installed,
for example by installing a new Patch Kit, you will need to modify
/vmunix again as described.

We've been running a local network with about 950 nodes without ARP
cache problems for over a year now, so this solution seems to be well
tested.

Additional note in case anyone is interested:
On Linux hosts the same modification can be implemented via the
/etc/sysctl.conf file (Redhat RHEL4 with kernel 2.6.9) at boot time:

# Don't allow the arp table to become bigger than this
net.ipv4.neigh.default.gc_thresh3 = 4096
# Tell the gc when to become aggressive with arp table cleaning.
# Adjust this based on size of the LAN.
net.ipv4.neigh.default.gc_thresh2 = 2048
# Adjust where the gc will leave arp table alone
net.ipv4.neigh.default.gc_thresh1 = 1024
# Adjust to arp table gc to clean-up more often
net.ipv4.neigh.default.gc_interval = 3600
# ARP cache entry timeout
net.ipv4.neigh.default.gc_stale_time = 3600


Ole Holm Nielsen wrote:
I'm stumped by an apparent limit in the Tru64 UNIX kernel (v5.1A pk6) to
handle client node MAC-addresses for close to 1000 NFS clients.
We expanded our Linux cluster to 900+ nodes, and suddenly the
Tru64 UNIX NFS file-server randomly looses network communication with
many (or most) of the new nodes.  A "ping" doesn't work at either end of
the server-client connection.  Communication between Linux servers and
nodes works perfectly, however, so we do not believe there to be a
problem with the network setup.

What happens is I believe "ARP cache trashing":  The Tru64 kernel
apparently can't cope with close to 1000 MAC-addresses simultaneously
because a fixed-size ARP cache fills up, and the kernel starts deleting
MAC-addresses from the ARP cache randomly.  See "man 7 arp"
on a Linux box about the cache.  On the Linux boxes we solve the ARP
cache problem by loading a static cache from the /etc/ethers file, but
on Tru64 UNIX this causes a dead-sure communications failure :-(

Browsing the Tru64 UNIX manuals and the "dxkerneltuner" tool, I haven't
been able to find any kernel parameter which may increase the maximum
size of the ARP cache.  Can anyone help ?
Note: The 900 nodes are divided about equally between two Gigabit
interfaces on the Tru64 UNIX server.

-- Ole Holm Nielsen Department of Physics, Technical University of Denmark, Building 307, DK-2800 Kongens Lyngby, Denmark



Relevant Pages

  • Questions about 192.168
    ... show up in my arp cache after doing this. ... cable modem directly, it did show up in my arp cache. ... I recently checked my firewall (Network ICE), and noticed an attack ...
    (Security-Basics)
  • Re: [Full-Disclosure] Re: The ArpSucker is b0rn! Be yourself, be the net.
    ... > If it's not a threat to you're wonderfully managed ... >> the ENTIRE NETWORK! ... >> The tests of the global arp cache smashing were ... The Elibus' ...
    (Full-Disclosure)
  • Re: SMS 2003 Network Discovery (idle state detected)
    ... I am using SMS 2003 to manage only servers of my network. ... I configured the network discovery only with the options that I need ... I have to use only SNMP devices and the arp cache for network ... I think that the problem is in my router, ...
    (microsoft.public.sms.admin)
  • Re: DHCP access
    ... Could you please explain what is meant by '/24 segment'? ... meaning that there are 256 IP's in that network. ... >>because the arp cache has to be filled somehow. ... the other station) back throughout the segment. ...
    (freebsd-questions)
  • Re: Network discovery and HP9315M Switch
    ... Do the network printers have snmp installed? ... Since DHCP is not an option, snmp is your best bet. ... The router arp cache although available usually times out fairly ...
    (microsoft.public.sms.inventory)