Re: Response issues on GS1280, VMS 7.3-2

From: Lee (lytmah_at_telusplanet.net)
Date: 07/15/05


Date: Fri, 15 Jul 2005 17:53:34 GMT

Do you have LAVC$FAILURE_ANALYSIS in place? This would help you
determine if transient network problems as a contributing factor.

        We've ruled out network as the cause of our problem.

Do you see a lot of disk mount verifications?

        No disk mount verifications of any kind in the last 5 days.

Is this just any arbitrary DCL command, or something dealing with
specific file(s) or disk(s)? If it's any arbitrary DCL command, then
maybe a CPU shortage is involved, including things like saturation of
the primary CPU in interrupt state ($MONITOR MODES/ALL could help check
for that).

        Users run approx. 1,000 commands in the same format when
        they log into the cluster:
                ...
$ DEF/NOLOG/JOB FILE1 FILE1.DAT
$ DEF/NOLOG/JOB FILE2 FILE2.DAT
$ DEF/NOLOG/JOB FILE3 FILE3.DAT
$ DEF/NOLOG/JOB FILE4 FILE4.DAT
        …

HP suspects logical name translation to be our problem.
Specifically one of our system logicals called XXXXXX.
Here's trace results of a few seconds from SDA.

Logical Name Trace Information from node L:
        Count Logical Name
         2150 XXXXXX
          294 SYS$SYSROOT
          166 SYS$SHARE
          128 SYS$COMMON
                …
Logical Name Trace Information from node M:
        Count Logical Name
         3561 XXXXXX
          158 SYS$SYSROOT
          113 SYS$SHARE
           69 SYS$COMMON

During the interactive degradation, CPU usage is very low,
the disk queue length is less than 1,…

I've traced LNM and the high RECLAIM count is not specific to
any one program or any process.
This logical is used by all the FIO routines in our in-house
application programs.
90% of the applications running on the four fairly homogeneous
cluster nodes are in-house.

The strange thing is, most of the programs have not changed
since the migration to GS1280 in May/2005.
Same application programs and the FIO routines were compiled
from a few years ago.
Programs running on the ES45's, no problem.

After the first node was migrated to a GS1280 hard partition,
users experience degraded response on it.

When we had four ES45's, we could roll out one node for
SW/HW maintenance with no response problem.
Now, when we take one GS1280 node out, response is embarrassing.

I've specified System Health Check and and T4 to run daily.

Keith Parris wrote:
> Lee wrote:
>
>> Five-node Gigabit Ethernet VMScluster across three sites.
>
...



Relevant Pages

  • Re: Response issues on GS1280, VMS 7.3-2
    ... I understand that this has not changed going from ES45 -> ... > Specifically one of our system logicals called XXXXXX. ... > Logical Name Trace Information from node L: ... Is it a debug flag? ...
    (comp.os.vms)
  • Re: Matlab Matricies
    ... Neil Iain O'Leary wrote: ... Nice attempted use of logicals, but it doesn't work quite right. ... the techniques the other posters have given, or which I gave in response ...
    (comp.soft-sys.matlab)