Strange CAA behaviour on two clusters

From: Tim Cutts (tjrc_at_sanger.ac.uk)
Date: 03/16/05

  • Next message: jpfitz_at_fnal.gov: "Good-bye"
    Date: Wed, 16 Mar 2005 10:04:45 +0000
    To: tru64-unix-managers@ornl.gov
    
    

    We have a number of clusters running CAA MySQL instances. For the past
    two days, all the instances on two clusters (one 6 node, the other 2
    node) have briefly failed their check scripts simultaneously, with a
    message like the one below.

    Begin forwarded message:

    > From: system PRIVILEGED account <root@sanger.ac.uk>
    > Date: 15 March 2005 6:03:15 am GMT
    > Subject: EVM ALERT [600]: CAAD[3146717]: RTD #0: Action Script
    > /var/cluster/caa/script/mysql_3366.scr(check) timed out! (timeout=60)
    >
    > ============================ Syslog event ============================
    > EVM event name: sys.unix.syslog.daemon
    >
    > Syslog daemon events are posted by system daemons to alert the
    > administrator to an unusual condition. The user name field usually
    > indicates which daemon posted the event. The text of the message
    > indicates the reason for the event.
    >
    > ======================================================================
    >
    > Formatted Message:
    > CAAD[3146717]: RTD #0: Action Script
    > /var/cluster/caa/script/mysql_3366.scr(check) timed out!
    > (timeout=60)
    >
    > Event Data Items:
    > Event Name : sys.unix.syslog.daemon
    > Priority : 600
    > PID : 3146415
    > PPID : 3145729
    > Event Id : 561375
    > Member Id : 6
    > Timestamp : 15-Mar-2005 06:03:09
    > Host Name : ecs2f
    > Cluster Name : ecs2
    > User Name : root
    > Format : CAAD[3146717]: RTD #0: Action Script
    > /var/cluster/caa/script/mysql_3366.scr(check)
    > timed
    > out! (timeout=60)
    > Reference : cat:evmexp.cat:200
    >
    > Variable Items:
    > None
    >
    > ======================================================================

    We get one of these messages for each of the six instances running on
    that particular cluster, all with the same timestamp.

    Now, all the check script actually does is as follows, sending us a
    detailed message if the mysql ping fails:

    --- BEGIN CHECK SCRIPT ---
    #!/bin/sh
    #
    # Usage: check-mysql.sh instance
    #
    #
    if [ $# -lt 1 ]
    then
    echo "You must supply the instance name."
    echo "example $0 mysql-3350"
    exit 1
    fi

    output=`/usr/local/ensembl/mysql/bin/mysqladmin
    --defaults-file=/var/cluster/caa/script/mysql/$1.cnf -h ecs2 ping 2>&1`

    code=$?

    case "$_CAA_REASON" in
         user) ;; # It's a manual user request, don't report it
         *)
             if [ $code -ne 0 ]; then
             /usr/lib/sendmail -t <<EOF
    To: ssg-isg@sanger.ac.uk
    From: ssg-isg@sanger.ac.uk
    Subject: MySQL CAA instance ecs2:$1 failed check

    CAA_REASON: $_CAA_REASON

    $output
    EOF
    fi
             ;;
    esac

    exit $code
    --- END CHECK SCRIPT ---

    Now, we never get that message, which suggests that CAA is killing the
    check script before it gets to the sent message; which means it's
    likely that the mysqladmin ping is blocking.

    But why should that be the case? The MySQL instance has been running
    all the time according to mysqladmin status:

    11:58:47 tjrc@ecs2d:~$ /usr/local/ensembl/mysql/bin/mysqladmin -P 3361
    -h ecs2 status
    Uptime: 421321 Threads: 1 Questions: 26385 Slow queries: 6 Opens:
    218 Flush tables: 1 Open tables: 163 Queries per second avg: 0.063

    I can imagine one of the instances being too busy to respond,
    occasionally, but not all of them.

    I then suspected the cluster alias configuration might be wrong, but it
    looks fine to me in /etc/clua_services:

    mysql_3361 3361/tcp in_single,static
    mysql_3362 3362/tcp in_single,static
    mysql_3363 3363/tcp in_single,static
    mysql_3364 3364/tcp in_single,static
    mysql_3365 3365/tcp in_single,static
    mysql_3366 3366/tcp in_single,static

    and is identical to the configuration used on a third cluster, which is
    not seeing the same issue.

    We then suspected that it might be the network that's at fault, since
    both affected clusters are connected to the same switches. But that
    shouldn't be the case because the routes for the cluster alias don't
    use the external network:

    12:05:42 tjrc@ecs2d:~$ netstat -r | grep ecs2\\.
    ecs2.internal.sanger.ac.uk localhost UH 6 1003685847 lo0

    Unless, of course, niffd changed the routing tables, but I can't find
    any evidence of that directly, although there are a *lot* of NIFF
    events on ecs2.

    So, I switched off niffd, to see whether the problem went away.

    [ next day ]

    The same thing happened again today (at a different time - 9am rather
    than 6am). So it isn't niffd.

    The 6am and 9am times are suspicious; I'm wondering whether there's
    some cron task which happens at those times which is causing problems,
    but I can't see any such tasks - and why would they cause the cluster
    alias to fail to listen to the loopback interface anyway?

    I'm stumped. Anyone got any bright ideas?

    Tim

    -- 
    Dr Tim Cutts
    Informatics Systems Group, Wellcome Trust Sanger Institute
    GPG: 1024D/E3134233 FE3D 6C73 BBD6 726A A3F5  860B 3CDD 3F56 E313 4233
    

  • Next message: jpfitz_at_fnal.gov: "Good-bye"