[SUMMARY] ASE hiccup leads to domain panic leads to strange ASE/AdvFS state

From: Speakman, John H./Epidemiology-Biostatistics (speakmaj_at_MSKCC.ORG)
Date: 09/29/03

  • Next message: Speakman, John H./Epidemiology-Biostatistics: "[SUMMARY]: login never gets to prompt, waiting for NFS3 service"
    Date: Mon, 29 Sep 2003 17:38:10 -0400
    To: tru64-unix-managers@ornl.gov
    
    

    Thanks to Johan and Yogesh for insightful comments. We tried first to
    unmount the errant AdvFS domains (the ones that appeared not to be under
    ASE's control); they came back with a "device busy" even when we
    couldn't find any app/user that might be using them. Then I tried to
    relocate the services from the server that looked like it had the
    "right" ones to where it had the "wrong" ones, asemgr hung. Ctrl-C had
    no response so I closed that terminal window. Now I couldn't login at
    all as either host would hang after /etc/motd with a "nfs server xxx not
    responding still trying" (see separate summary under separate cover).
    If you're set up in a more straightforward way you should be able to log
    in, but the fact is that the service you were trying to move will
    probably not be accessible. At this point, as I couldn't log in, I
    decided to bring the server down hard with the reset button. I chose
    the (to us) less important server, the one that was hostng the "right"
    services. Immediately I did so I scored a result; the "right" services
    "overwrote" the "wrong" ones on the other server. Scary and I wouldn't
    like to do it again (mainly because I don't know why it worked).
    However I made two changes which I credit with making the thing work
    fine so far since. First I turned off defragcron; then I shifted a few
    files to make the volume less full.

    Not so much a summary as a diary, but maybe it will help someone.

    John

    -----Original Message-----
    From: Speakman, John H./Epidemiology-Biostatistics
    Sent: Friday, September 19, 2003 3:38 PM
    To: tru64-unix-managers@ornl.gov
    Cc: Speakman, John H./Epidemiology-Biostatistics
    Subject: ASE hiccup leads to domain panic leads to strange ASE/AdvFS
    state

    Hi all

    We have a little cluster of two very old Alphas running 4.0E - they are
    clustered using ASE over a private network (i.e. a crossover cat 5
    cable). We haven't changed the configuration in years and two nights
    ago it had a hiccup.. what we see in syslog is...

    Sep 18 03:36:28 biosta vmunix: arp: local IP address 192.168.32.228 in
    use by hardware address 00-00-...
    Sep 18 03:36:29 biosta vmunix: arp: local IP address 192.168.32.227 in
    use by hardware address 00-00-...
    Sep 18 03:36:29 biosta vmunix: arp: local IP address 192.168.32.226 in
    use by hardware address 00-00-...

    These IP addresses are the internal (non-public) IP addresses of three
    of the NFS volumes shared by ASE. The hardware address is the address
    of the NIC on the other server in the cluster that's connected to the
    crossover cat 5 cable. Both servers got this message at the same time,
    each complaining that the other guy was holding the IP address.

    The next set of messages on server A (the server that was hosting the
    services at the time) are a nasty series of domain I/O errors and domain
    panics on these three domains. Server B reported no further problems
    (in syslog anyway).

    Two of the three domains relocated (via ASE) on the other server and
    also magically seemed to reconstitute themselves (not via ASE) on the
    same server (i.e., the domains now appear on the 'df' of both servers,
    something we have not seen before (the third domain, which is configured
    not to automatically fail over, reconstituted itself on the same server
    only, just fine).

    So basically we have these two "fake" AdvFS domains which ASE doesn't
    know about, on server A, as well as the two "real" domains which are on
    server B (our ASE is configured to automatically relocate services back
    to the preferred server when it becomes available again after failover).
    Furthermore, 'df' on server A reveals that the "fake" AdvFS domains are
    not consistent with the real ones in terms of space occupied; they are a
    little out, like they are no longer in sync.

    Everything is working fine to the users, nobody has complained. The
    only reason we fould out was a backup job that was running at the time
    suddenly disappeared (its log file is on one of the domains in question,
    maybe that's why). But now we have this strangeness and I'm guessing
    that if I reboot the cluster, something bad might happen, like a domain
    not come back.

    So I was going to try and use asemgr to fail the services back over to
    server A and hope that everything will magically sync itself. Anyone
    think that would be a mistake?

    Thanks
    John Speakman
    Memorial Sloan-Kettering Cancer Center, NYC

     
         =====================================================================
         
         Please note that this e-mail and any files transmitted with it may be
         privileged, confidential, and protected from disclosure under
         applicable law. If the reader of this message is not the intended
         recipient, or an employee or agent responsible for delivering this
         message to the intended recipient, you are hereby notified that any
         reading, dissemination, distribution, copying, or other use of this
         communication or any of its attachments is strictly prohibited. If
         you have received this communication in error, please notify the
         sender immediately by replying to this message and deleting this
         message, any attachments, and all copies and backups from your
         computer.


  • Next message: Speakman, John H./Epidemiology-Biostatistics: "[SUMMARY]: login never gets to prompt, waiting for NFS3 service"
  • Quantcast