Outbound TCP issue, potentially related to 'FreeBSD-SA-05:08.kmem [REVISED]'

From: Matt Ruzicka (matt_at_frii.com)
Date: 05/13/05

  • Next message: Mike Silbersack: "Re: Outbound TCP issue, potentially related to'FreeBSD-SA-05:08.kmem [REVISED]'"
    Date: Thu, 12 May 2005 17:13:48 -0600 (MDT)
    To: freebsd-net@freebsd.org
    
    

    A couple days after we patched our systems, we started to receive a number
    of reports of mysql connection errors when our patched FreeBSD 4.9 web
    servers were trying to connect to our mysql server, which lives on a
    separate FreeBSD machine.

    Initially we thought this was a networking error related to our server
    load balancer (which has been a troublemaker in the past) or some other
    networking device, but testing has proven otherwise.

    * Problem description:

    Outbound TCP connections are randomly failing to connect. They receive a
    "Can't assign requested address" error from the connect() call. The error
    has been demonstrated against multiple machines on multiple different
    ports. It only impacts outgoing connections from our web servers - no
    inbound connections have failed or dropped. Also, we have not seen this
    problem on any of our other servers, which have also been patched.

    The errors are sporadic. The most frequent pattern we've seen is a 5
    to 10 minute period of success, followed by a couple of seconds of
    frequent failures. When we start getting errors connecting to one
    port/machine we see concurrent errors to other ports/machines.

    * What we've tried:

    The impacted machines are in a server-load-balanced environment, so we
    spent quite a bit of time convincing ourselves that this was not an
    external network error. We created a perl test script that tries to
    connect to a given machine and port once per second and logs its
    success or failure. (script is included below) We then aimed it at
    machines both inside and outside the SLB environment.

    We originally tried it against multiple different ports, but after
    finding that the failures were not port-specific, we simplified the
    methodology to make all connections to port 5666. (a monitoring app)

    Reverse tests were also run to see if the failures impacted incoming
    connections. No failures were ever logged in this direction.

    The tests established that we reliably saw failures from the two
    impacted machines to any other server, including each other. (The two
    boxes are separated by a switch, but not the SLB.) It did not matter
    if the remote machine was on the same network, or was in front or
    behind the SLB switch. Connections between other machines behind the
    same switch showed no failures.

    We next set up tcpdump on one impacted machine and started logging the
    test connections. When a failure occurred, the dumps showed no packets
    leaving the box to the target machine.

    At that point we felt reasonably confident that the problem was not an
    external network issue, so we moved on to systems troubleshooting.

    Since this machine was running a few revisions behind we felt it would be
    prudent to upgrade to the latest release of FreeBSD.

    Both web servers have since been upgraded to the latest version of 4.11 to
    ensure it was not an issue related to the old versions we were running.
    After the upgrade errors returned to the previous levels after a few
    hour lull.

    Apache, PHP and related modules were both reinstalled on the boxes after
    the FreeBSD upgrade to ensure they were using the correct libraries and
    such.

    The only error we have found in the logs was right after boot and is
    related to PMAP_SHPGPERPROC and discussed here:

      http://lists.freebsd.org/pipermail/freebsd-hackers/2003-May/000695.html

    If I understand this correctly we should have plenty of PV entries
    available.

    -----
    Message Queues:
    T ID KEY MODE OWNER GROUP CREATOR CGROUP CBYTES QNUM QBYTES LSPID LRPID STIME RTIME CTIME

    Shared Memory:
    T ID KEY MODE OWNER GROUP CREATOR CGROUP NATTCH SEGSZ CPID LPID ATIME DTIME CTIME
    m 262144 0 --rw------- root wheel root wheel 21 524288 81250 8125014:03:40 17:02:37 14:03:40
    m 458754 0 --rw------- root wheel root wheel 42 524288 74667 7466716:06:03 17:02:39 16:06:03

    Semaphores:
    T ID KEY MODE OWNER GROUP CREATOR CGROUP NSEMS
    OTIME CTIME

    ITEM SIZE LIMIT USED FREE REQUESTS
    PV ENTRY: 28, 2281326, 545883, 1036172, 589082427
    -----

    * Test script:

    Note that we also tried a similar script using raw socket calls, rather
    than using IO::Socket. The results were identical.

    -----
    #!/usr/bin/perl

    use strict;
    use warnings;

    use Sys::Hostname qw(hostname);
    use IO::Socket;

    use constant LOG_FILE => '/tmp/';

    # host to connect to
    my $host = shift(@ARGV) || 'xxx.xxx.xxx.xxx';

    # open our log file
    my $log_file = LOG_FILE . hostname() . '_to_' . $host . '.nrpe';
    open(LOG, '>>', $log_file) or die "Can't open log: $log_file $!";

    while(1){

            my $start_time = time();

            # try a connection
            eval {
                    my $socket = IO::Socket::INET->new($host . ':5666') or die
    "Can't connect: $!";

                    $socket->close();
            };

            my $result = "ok";
            $result = "failed ($@)" if $@;

            print LOG hostname() . ' ' . scalar(localtime($start_time)) . ' ' . $result . "\n";

            sleep 1;
    }
    -----

    * Summary:

    Since this is not affecting any of our other servers, which have been
    patched, I do not feel it is a direct result of the patch, but suspect the
    patch may have accentuated an existing issue.

    Any suggestions as to what could be causing this would be greatly
    appreciated.

    Please let me know what additional information about the system I can
    gather if it will be of assistance.

    Thank you very much in advance.

    Matthew Ruzicka - Systems Administrator
    Front Range Internet, Inc.
    matt@frii.net - (970) 212-0728

    _______________________________________________
    freebsd-net@freebsd.org mailing list
    http://lists.freebsd.org/mailman/listinfo/freebsd-net
    To unsubscribe, send any mail to "freebsd-net-unsubscribe@freebsd.org"


  • Next message: Mike Silbersack: "Re: Outbound TCP issue, potentially related to'FreeBSD-SA-05:08.kmem [REVISED]'"

    Relevant Pages

    • Servers Crash every few days
      ... I have six servers running FreeBSD 6.2 and all of them have the same config. ... Memory for the servers are ... 1000 connection and sometimes when the connections is more 10000. ...
      (freebsd-questions)
    • Re: Remote Desktop Connection
      ... FreeBSD WickerBill wrote: ... Don't have a Windows server that meets that spec? ... There are many VNC servers software titles, ... > use X connections over slow links without noticeable lag. ...
      (freebsd-questions)
    • Re: Remote Desktop Connection
      ... Id like to asj you guys if you used any remote desktops with freebsd? ... X11 forwarding through ssh is great when you're connections between you ... There are many VNC servers software titles, ... remote desktop on Windows NT (in fact possibly faster from what I've ...
      (freebsd-questions)
    • Re: How to stop two servers in different sites trying to replicate with each other
      ... communicate directly with Site C and vice versa. ... ADSS the DC in Site B keeps setting up one of its replication partners to ... ISTG for intersites connections using BH) ... the ISTG won't use the BH servers between Site C and SiteB to ...
      (microsoft.public.win2000.active_directory)
    • Re: RRAS Dial on demand
      ... One of the servers I tested previously is now accepting connections. ... > Created DOD interface with name DOD1 ...
      (microsoft.public.windows.server.sbs)