summary:gridware application on 2-node cluster failed

From: Dr. Martin Körfer (koerfer_at_mpch-mainz.mpg.de)
Date: 03/07/05

  • Next message: Subhash P: "HSG80 & Tru cluster problem"
    Date: Mon, 07 Mar 2005 10:47:04 +0100
    To: tru64-unix-managers@ornl.gov
    
    

    The only answer I received to my demand below:

    Did you try the mailing list?

    http://gridengine.sunsource.net/project/gridengine/maillist.html

     -Ron Chen

    For sure I did.
    There were some hints on the error-mesasge described, but non of them solved my
    problem.
    So I tried around and found somewhat I would call a
    "workaround", where I can live with.

    -I used the single Tru64-Unix AS "gridsrv" as qmaster
    -started install_execd on the cluster-member "server1...."
      => no sge_execd was running
    - so on "server1...." I ran:
      #.../rcsge -migrate
      => same error-message: .......commd - qmaster not enrolled at commd-
    - going to "gridsrv" and running here (on reverse):
      #.../rcsge -migrate
      => the qmaster was successfully started on the single AS
         and surprisingly on "server1...." the sge_execd was running
         and I could use him as execution-host.
    This fullfilled my demands and I stopped further investigations in the problem.
    Anyway it would be interesting what caused the problem????

    Martin

    -------------------------------------------------------------------------------
    Demand:

    Hi managers,

    after a system-crash -> successful restore of a 2ES40-node / HSG80-cluster
    running Tru64 V5.1a PK6, all services were restarted successfully except for
    "SGE 5.3-gridware".
    It came up with the error-message:

    -unable to contact qmaster via "server1.mpch-mainz.mpg.de" commd - qmaster not
    enrolled at commd-

    were "server1.mpch-mainz.mpg.de" is one of the cluster-nodes, used as "qmaster".

    -> no "sge_qmaster" was started
    -> no "sge_execd" was started

    Using a single Alpha-Server (not in the cluster-envitonment) as "qmaster" I
    succeded -> all daemons running;
    Now using "server1.mpch-mainz.mpg.de" as execution-host and
    starting "install_execd" on it, ran without error, but
    only "sge_commd" was running !not! "sge_execd" (as on other "execution-hosts"
    not in the cluster).
    Even on the second cluster-member "server2.mpch-mainz.mpg.de" I got the same
    result as on "server1".

    Trying a brand new Installation of the "SGE-5.3-Software")

    #/soft/gridware/sge/inst_sge

    at least I resulted in the error-message:

    Grid Engine qmaster and scheduler startup
    -----------------------------------------

    Starting qmaster and scheduler daemon. Please wait ...
       starting sge_qmaster
    starting program: /soft/gridware/sge/bin/tru64/sge_commd
    using service "sge_commd"
    bound to port 536
    Reading in complexes:
            Complex "host".
            Complex "queue".
    Reading in execution hosts.
    Reading in administrative hosts.
    Reading in parallel environments:
            PE "make".
    Reading in scheduler configuration
       starting sge_schedd

    error: getting configuration: unable to contact qmaster via "" commd - qmaster
    not enrolled at commd
    error: can't get configuration from qmaster -- backgrounding

    -> "sge_commd" and "sge_schedd" were started but
       "sge_qmaster-" and "sge_execd-" were missing

    So I came to the conclusion that due to the system-restore on the cluster
    something is missing (possibly a "socket" or something else).

    Anybody has any idea, why the "sge_qmaster-" and "sge_execd-" not were started
    on the cluster-nodes, but run on the Single Alpha-Server???

    Right now (after a week working on it) I am out of ideas.

    Any help would be appreciated

    Thanks in advance

    Martin Körfer

    --
    Dr.Martin Körfer
    Max-Planck-Institut für Chemie
    Elektronik
    J.J.Becherweg 27
    55128 Mainz
    Tel.: -49-6131-305488
    Fax:  -49-6131-305318
    -------------------------------------------------
    This mail sent through IMP: webmail.mpch-mainz.mpg.de
    

  • Next message: Subhash P: "HSG80 & Tru cluster problem"