Response issues on GS1280, VMS 7.3-2

From: Lee (lytmah_at_telusplanet.net)
Date: 07/12/05


Date: Tue, 12 Jul 2005 21:46:47 GMT

I'm experiencing response problem on a GS1280 cluster.
For those interested, here's the information.

Description of cluster:
Five-node Gigabit Ethernet VMScluster across three sites.
All nodes at VMS 7.3-2 patched up to Jun-2005.
Site 1: DS10 quorum node.
Site 2: Two production nodes on GS1280/16. Each 8 GB memory.
        First node 4 CPU’s, second node 6 CPU’s.
Site 3: Two production nodes on GS1280/16. Each 8 GB memory.
        First node 4 CPU’s, second node 6 CPU’s.
Third party SAN disk environment.
Using HBVS with 80 shadow members at each of sites 2 and 3.

Problem:
        Early this year, the cluster was at VMS 7.3-2 running on four
        individual ES45’s. Since migrating from the ES45 nodes to
        four nodes on two GS1280’s in May-2005, interactive users have
        been experiencing intermittent several-second periods of slow
response. The situation is occurring on all four production nodes.
Symptoms are more pronounced and wide-spread during peak
periods (mid-morning, noon). I myself notice occasional lags
of several seconds after entering a command in DCL.

Some observations during problem periods:

OBSV #1 Memory, CPU usage, critical parameter usage is decent.

Physical Memory Usage (pages): Total Free In Use Used
Modified
   Main Memory (8.00GB) 1048576 557009 314891 46.8%
    176676
Virtual I/O Cache (Kbytes): Total Free In Use Used
   Cache Memory (MBytes) 125.00 1.56 123.43 98.7%
Granularity Hint Regions (pages): Total Free In Use Used
Released
   Execlet code region 1536 312 1224 79.6%
         0
   Execlet data region 408 72 336 82.3%
         0
   S0/S1 Executive data region 12874 0 12874 100.0%
         0
   Resident image code region 1024 16 1008 98.4%
         0
Slot Usage (slots): Total Free Resident Used
  Swapped
   Process Entry Slots 2000 955 1045 52.2%
         0
   Balance Set Slots 1998 955 1043 52.2%
         0
Dynamic Memory Usage (bytes) Total Free In Use Used
  Largest
   Nonpaged 95.36 69.20 26.16 27.4%
     56.26
   Bus Addressable Memory 4.12 4.10 0.01 .2%
      4.10
   Paged 143.04 90.12 52.92 36.9%
     85.11
   Lock Manager Dyn Memory 62.01 23.08 38.93 62.7%
Buffer Object Usage (pages): In Use Peak
   32-bit System Space Windows (S0/S1) 0 2
   64-bit System Space Windows (S2) 100 102
   Physical pages locked by buffer objects 100 102
Memory Reservations (pages): Group Reserved In Use Type
   Total (0 bytes reserved) 0 0
Write Bitmap (WBM) Memory Summary
   Local bitmap count: 0 Local bitmap memory usage (bytes)
  0.00
   Master bitmap count: 0 Master bitmap memory usage (bytes)
  0.00
Swap File Usage (8KB pages): Index Free Used
     Size
   DISK$K_SYSTEM:[SYS0.SYSEXE]SWAPFILE.SYS 1 18744 .0%
     18744
Paging File Usage (8KB pages): Index Free Used
     Size
   SS635:[000000]PAGEFILE3_K.SYS;1 252 187496 .0%
    187496
   SS642:[000000]PAGEFILE2_K.SYS;1 253 187496 .0%
    187496
   DISK$K_SYSTEM:[SYS0.SYSEXE]PAGEFILE.SYS 254 365192 .0%
    365192
   Total size of all paging files:
    740184
   Total committed paging file usage: 56%
    414507
C: 1048576 pages of memory. 54310 pages (5.1%) permanently allocated
to VMS.

OBSV #2 No resource hogs have been found on any of the nodes.

OBSV #3 Sequential snapshots of the nodes show many processes in/out
        of MUTEX. The processes in MUTEX range widely, from OPCOM
        to production users. These processes slide in and out of
        MUTEX so quickly that there is inadequate time to determine
        the reason for the MUTEX state.

Display 1:

OpenVMS V7.3-2 on node A 12-JUL-2005 14:29:21.22 Uptime 26 04:27:38
   Pid Process Name State Pri I/O CPU Page flts
  Pages

OpenVMS V7.3-2 on node B 12-JUL-2005 14:29:21.24 Uptime 39 21:37:28
   Pid Process Name State Pri I/O CPU Page flts
  Pages
23C00817 SECURITY_SERVER CUR 0 16 184409 0 07:50:51.19 433
   513
23C00934 TCPIP$TNS43 MUTEX 4 2338026 0 00:02:58.71 448114
   211
23C721AF _TNA1541: MUTEX 4 2739 0 00:00:03.59 1890
   455
23C4FBC6 _TNA1555: MUTEX 4 544 0 00:00:02.28 630
   813
23C32431 _TNA1560: MUTEX 4 81 0 00:00:02.44 128
   118
23C0A433 _TNA788: MUTEX 4 48874 0 00:00:23.25 45936
   543
23C3BC36 RA23C2BC91 MUTEX 4 133 0 00:00:00.05 328
   544 S
23D7E642 _TNA4059: SUSP 0 3636 0 00:03:51.03 4264
   146

OpenVMS V7.3-2 on node C 12-JUL-2005 14:29:21.52 Uptime 40 22:08:12
   Pid Process Name State Pri I/O CPU Page flts
  Pages
23A009B2 Epoch SUSP 2101415952 0 14:16:04.80 7929
  8079
23A501C5 _TNA3820: MUTEX 6 87 0 00:00:01.42 131
   118
23A069C7 _TNA3821: MUTEX 6 83 0 00:00:02.11 128
   118
23A011C8 _TNA3822: MUTEX 6 139 0 00:00:02.32 146
   136
23A079D5 RA_18 MUTEX 16 0 0 00:00:00.00 30
    41 S
23B1B32A _TNA2367: MUTEX 6 108033 0 00:00:47.24 61677
   433
23A2441F DALE MUTEX 5 3184258 0 02:13:07.11 43266
  5397
23A084D5 GR_23B1C4D2 MUTEX 6 4592 0 00:00:03.14 107295
    63 S
23BABE8D 0323 CUR 2 4 214154 0 00:01:50.15 654
   641
23A35710 _TNA3806: MUTEX 6 1141 0 00:00:03.48 1248
   491

OpenVMS V7.3-2 on node D 12-JUL-2005 14:29:21.74 Uptime 38 21:10:47
   Pid Process Name State Pri I/O CPU Page flts
  Pages

OpenVMS V7.3-2 on node E 12-JUL-2005 14:29:21.91 Uptime 47 00:19:52
   Pid Process Name State Pri I/O CPU Page flts
  Pages
23227194 _TNA4239: CUR 5 16 200856 0 00:00:25.24 17587
   141
2326535D _TNA5345: MUTEX 8 95499 0 00:00:33.43 8715
   944

Display 2: Several seconds later.

OpenVMS V7.3-2 on node A 12-JUL-2005 14:29:25.14 Uptime 26 04:27:42
   Pid Process Name State Pri I/O CPU Page flts
  Pages

OpenVMS V7.3-2 on node B 12-JUL-2005 14:29:25.16 Uptime 39 21:37:32
   Pid Process Name State Pri I/O CPU Page flts
  Pages
23C183A6 _TNA1553: MUTEX 6 920 0 00:00:04.46 1325
   382
23C32431 _TNA1560: CUR 0 16 85 0 00:00:03.91 128
   118
23C0A433 _TNA788: MUTEX 4 48936 0 00:00:23.30 45956
   210
23C3BC36 RA23C2BC91 RWSCS 6 348 0 00:00:00.24 505
   767 S
23C4F43A TCPIP$FTPC06025 MUTEX 5 117 0 00:00:00.22 238
   256 N
23D7E642 _TNA4059: SUSP 0 3636 0 00:03:51.03 4264
   146

OpenVMS V7.3-2 on node C 12-JUL-2005 14:29:25.37 Uptime 40 22:08:16
   Pid Process Name State Pri I/O CPU Page flts
  Pages
23A008F6 MONITOR_GATEWAY CUR 5 1 29282337 0 00:52:17.41 27612815
   118
23A009B2 Epoch SUSP 2101415952 0 14:16:04.80 7929
  8079
23A2441F DALE CUR 5 5 3184591 0 02:13:08.02 43266
  5397
23BABE8D 0323 CUR 5 5 215096 0 00:01:53.68 654
   641

OpenVMS V7.3-2 on node D 12-JUL-2005 14:29:25.56 Uptime 38 21:10:50
   Pid Process Name State Pri I/O CPU Page flts
  Pages
23F8CA29 _TNA866: RWSCS 4 8664 0 00:00:07.98 4622
  1913

OpenVMS V7.3-2 on node E 12-JUL-2005 14:29:25.83 Uptime 47 00:19:56
   Pid Process Name State Pri I/O CPU Page flts
  Pages

Display 3: Several seconds later.

OpenVMS V7.3-2 on node A 12-JUL-2005 14:29:28.25 Uptime 26 04:27:45
   Pid Process Name State Pri I/O CPU Page flts
  Pages

OpenVMS V7.3-2 on node B 12-JUL-2005 14:29:28.26 Uptime 39 21:37:35
   Pid Process Name State Pri I/O CPU Page flts
  Pages
23D7E642 _TNA4059: SUSP 0 3636 0 00:03:51.03 4264
   146

OpenVMS V7.3-2 on node D 12-JUL-2005 14:29:28.39 Uptime 40 22:08:19
   Pid Process Name State Pri I/O CPU Page flts
  Pages
23A009B2 Epoch SUSP 2101415952 0 14:16:04.80 7929
  8079
23A2441F DALE CUR 2 4 3185385 0 02:13:09.77 43266
  5397
23BABE8D 0323 CUR 4 5 215645 0 00:01:56.57 654
   641

OpenVMS V7.3-2 on node D 12-JUL-2005 14:29:28.51 Uptime 38 21:10:53
   Pid Process Name State Pri I/O CPU Page flts
  Pages

OpenVMS V7.3-2 on node E 12-JUL-2005 14:29:28.61 Uptime 47 00:19:58
   Pid Process Name State Pri I/O CPU Page flts
  Pages

OBSV #4 HP has identified one main problem as being in logical name
translation.
        Here’s the status from the four nodes (from MONITOR IO).
                                    CUR AVE MIN MAX
Log Name Translation Rate 198.66 906.97 0.00 9845.33
Log Name Translation Rate 3902.00 3896.64 0.00 15286.00
Log Name Translation Rate 2077.00 1341.27 0.00 13067.33
Log Name Translation Rate 1690.66 621.39 0.00 3901.33

On the ES45’s, I could execute a procedure containing 1000 logical
name translations in a split second. On the GS1280 nodes, the
same procedure requires from several to 10 seconds.

OBSV #5 I ran Autogen with feedback and a couple of items stood out.
        …
        The feedback data is based on 931 hours of up time.
        …

MSCP_BUFFER parameter information:
         Feedback information.
            Old value was 1300, New value is 1300
            MSCP server I/O rate: 367 I/Os per 10 sec.
            I/Os that waited for buffer space: 1564
            I/Os that fragmented into multiple transfers: 3276

** WARNING ** - The value of MIN_IMGIOCNT specified
         (228) is less than the minimum permissible value of 332.
         Please review the MODPARAMS definitions of MIN_IMGIOCNT.
             MIN_IMGIOCNT will be set to 332.

Additional nodes:

        1. System parameter values on the GS1280 nodes are at the
                same or higher level than on the ES45’s.
        2. Our disk defragger has not been running for the last
                6 months due to it breaking on SAN disks with
                unexpected values coming from DVE
                (logical/total/expansion).
                The vendor has fixed the problem, but I
                haven’t implemented defragging yet.
                However, I have not noticed any disk queue
                length problems.
        3. The number of processes running on the cluster
                has not increased appreciably since the migration
                to the GS1280’s.
        4. HP’s T4 has been installed on the four production nodes
                but not yet running.
        5. HP’s System Health Check has been installed on the four
                production nodes but not yet running.

Has anyone else experienced similar response issues?
Any ideas?



Relevant Pages