Response issues on GS1280, VMS 7.3-2
From: Lee (lytmah_at_telusplanet.net)
Date: 07/12/05
- Next message: AEF: "Re: Now the UK Has it's own date in the history of al quiada attacks"
- Previous message: Hoff Hoffman: "Re: Facility Prefix Registration (was: Re: A tale of two macro libraries LIB vs STARLET"
- Next in thread: Keith Parris: "Re: Response issues on GS1280, VMS 7.3-2"
- Reply: Keith Parris: "Re: Response issues on GS1280, VMS 7.3-2"
- Reply: Hein: "Re: Response issues on GS1280, VMS 7.3-2"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Date: Tue, 12 Jul 2005 21:46:47 GMT
I'm experiencing response problem on a GS1280 cluster.
For those interested, here's the information.
Description of cluster:
Five-node Gigabit Ethernet VMScluster across three sites.
All nodes at VMS 7.3-2 patched up to Jun-2005.
Site 1: DS10 quorum node.
Site 2: Two production nodes on GS1280/16. Each 8 GB memory.
First node 4 CPU’s, second node 6 CPU’s.
Site 3: Two production nodes on GS1280/16. Each 8 GB memory.
First node 4 CPU’s, second node 6 CPU’s.
Third party SAN disk environment.
Using HBVS with 80 shadow members at each of sites 2 and 3.
Problem:
Early this year, the cluster was at VMS 7.3-2 running on four
individual ES45’s. Since migrating from the ES45 nodes to
four nodes on two GS1280’s in May-2005, interactive users have
been experiencing intermittent several-second periods of slow
response. The situation is occurring on all four production nodes.
Symptoms are more pronounced and wide-spread during peak
periods (mid-morning, noon). I myself notice occasional lags
of several seconds after entering a command in DCL.
Some observations during problem periods:
OBSV #1 Memory, CPU usage, critical parameter usage is decent.
Physical Memory Usage (pages): Total Free In Use Used
Modified
Main Memory (8.00GB) 1048576 557009 314891 46.8%
176676
Virtual I/O Cache (Kbytes): Total Free In Use Used
Cache Memory (MBytes) 125.00 1.56 123.43 98.7%
Granularity Hint Regions (pages): Total Free In Use Used
Released
Execlet code region 1536 312 1224 79.6%
0
Execlet data region 408 72 336 82.3%
0
S0/S1 Executive data region 12874 0 12874 100.0%
0
Resident image code region 1024 16 1008 98.4%
0
Slot Usage (slots): Total Free Resident Used
Swapped
Process Entry Slots 2000 955 1045 52.2%
0
Balance Set Slots 1998 955 1043 52.2%
0
Dynamic Memory Usage (bytes) Total Free In Use Used
Largest
Nonpaged 95.36 69.20 26.16 27.4%
56.26
Bus Addressable Memory 4.12 4.10 0.01 .2%
4.10
Paged 143.04 90.12 52.92 36.9%
85.11
Lock Manager Dyn Memory 62.01 23.08 38.93 62.7%
Buffer Object Usage (pages): In Use Peak
32-bit System Space Windows (S0/S1) 0 2
64-bit System Space Windows (S2) 100 102
Physical pages locked by buffer objects 100 102
Memory Reservations (pages): Group Reserved In Use Type
Total (0 bytes reserved) 0 0
Write Bitmap (WBM) Memory Summary
Local bitmap count: 0 Local bitmap memory usage (bytes)
0.00
Master bitmap count: 0 Master bitmap memory usage (bytes)
0.00
Swap File Usage (8KB pages): Index Free Used
Size
DISK$K_SYSTEM:[SYS0.SYSEXE]SWAPFILE.SYS 1 18744 .0%
18744
Paging File Usage (8KB pages): Index Free Used
Size
SS635:[000000]PAGEFILE3_K.SYS;1 252 187496 .0%
187496
SS642:[000000]PAGEFILE2_K.SYS;1 253 187496 .0%
187496
DISK$K_SYSTEM:[SYS0.SYSEXE]PAGEFILE.SYS 254 365192 .0%
365192
Total size of all paging files:
740184
Total committed paging file usage: 56%
414507
C: 1048576 pages of memory. 54310 pages (5.1%) permanently allocated
to VMS.
OBSV #2 No resource hogs have been found on any of the nodes.
OBSV #3 Sequential snapshots of the nodes show many processes in/out
of MUTEX. The processes in MUTEX range widely, from OPCOM
to production users. These processes slide in and out of
MUTEX so quickly that there is inadequate time to determine
the reason for the MUTEX state.
Display 1:
OpenVMS V7.3-2 on node A 12-JUL-2005 14:29:21.22 Uptime 26 04:27:38
Pid Process Name State Pri I/O CPU Page flts
Pages
OpenVMS V7.3-2 on node B 12-JUL-2005 14:29:21.24 Uptime 39 21:37:28
Pid Process Name State Pri I/O CPU Page flts
Pages
23C00817 SECURITY_SERVER CUR 0 16 184409 0 07:50:51.19 433
513
23C00934 TCPIP$TNS43 MUTEX 4 2338026 0 00:02:58.71 448114
211
23C721AF _TNA1541: MUTEX 4 2739 0 00:00:03.59 1890
455
23C4FBC6 _TNA1555: MUTEX 4 544 0 00:00:02.28 630
813
23C32431 _TNA1560: MUTEX 4 81 0 00:00:02.44 128
118
23C0A433 _TNA788: MUTEX 4 48874 0 00:00:23.25 45936
543
23C3BC36 RA23C2BC91 MUTEX 4 133 0 00:00:00.05 328
544 S
23D7E642 _TNA4059: SUSP 0 3636 0 00:03:51.03 4264
146
OpenVMS V7.3-2 on node C 12-JUL-2005 14:29:21.52 Uptime 40 22:08:12
Pid Process Name State Pri I/O CPU Page flts
Pages
23A009B2 Epoch SUSP 2101415952 0 14:16:04.80 7929
8079
23A501C5 _TNA3820: MUTEX 6 87 0 00:00:01.42 131
118
23A069C7 _TNA3821: MUTEX 6 83 0 00:00:02.11 128
118
23A011C8 _TNA3822: MUTEX 6 139 0 00:00:02.32 146
136
23A079D5 RA_18 MUTEX 16 0 0 00:00:00.00 30
41 S
23B1B32A _TNA2367: MUTEX 6 108033 0 00:00:47.24 61677
433
23A2441F DALE MUTEX 5 3184258 0 02:13:07.11 43266
5397
23A084D5 GR_23B1C4D2 MUTEX 6 4592 0 00:00:03.14 107295
63 S
23BABE8D 0323 CUR 2 4 214154 0 00:01:50.15 654
641
23A35710 _TNA3806: MUTEX 6 1141 0 00:00:03.48 1248
491
OpenVMS V7.3-2 on node D 12-JUL-2005 14:29:21.74 Uptime 38 21:10:47
Pid Process Name State Pri I/O CPU Page flts
Pages
OpenVMS V7.3-2 on node E 12-JUL-2005 14:29:21.91 Uptime 47 00:19:52
Pid Process Name State Pri I/O CPU Page flts
Pages
23227194 _TNA4239: CUR 5 16 200856 0 00:00:25.24 17587
141
2326535D _TNA5345: MUTEX 8 95499 0 00:00:33.43 8715
944
Display 2: Several seconds later.
OpenVMS V7.3-2 on node A 12-JUL-2005 14:29:25.14 Uptime 26 04:27:42
Pid Process Name State Pri I/O CPU Page flts
Pages
OpenVMS V7.3-2 on node B 12-JUL-2005 14:29:25.16 Uptime 39 21:37:32
Pid Process Name State Pri I/O CPU Page flts
Pages
23C183A6 _TNA1553: MUTEX 6 920 0 00:00:04.46 1325
382
23C32431 _TNA1560: CUR 0 16 85 0 00:00:03.91 128
118
23C0A433 _TNA788: MUTEX 4 48936 0 00:00:23.30 45956
210
23C3BC36 RA23C2BC91 RWSCS 6 348 0 00:00:00.24 505
767 S
23C4F43A TCPIP$FTPC06025 MUTEX 5 117 0 00:00:00.22 238
256 N
23D7E642 _TNA4059: SUSP 0 3636 0 00:03:51.03 4264
146
OpenVMS V7.3-2 on node C 12-JUL-2005 14:29:25.37 Uptime 40 22:08:16
Pid Process Name State Pri I/O CPU Page flts
Pages
23A008F6 MONITOR_GATEWAY CUR 5 1 29282337 0 00:52:17.41 27612815
118
23A009B2 Epoch SUSP 2101415952 0 14:16:04.80 7929
8079
23A2441F DALE CUR 5 5 3184591 0 02:13:08.02 43266
5397
23BABE8D 0323 CUR 5 5 215096 0 00:01:53.68 654
641
OpenVMS V7.3-2 on node D 12-JUL-2005 14:29:25.56 Uptime 38 21:10:50
Pid Process Name State Pri I/O CPU Page flts
Pages
23F8CA29 _TNA866: RWSCS 4 8664 0 00:00:07.98 4622
1913
OpenVMS V7.3-2 on node E 12-JUL-2005 14:29:25.83 Uptime 47 00:19:56
Pid Process Name State Pri I/O CPU Page flts
Pages
Display 3: Several seconds later.
OpenVMS V7.3-2 on node A 12-JUL-2005 14:29:28.25 Uptime 26 04:27:45
Pid Process Name State Pri I/O CPU Page flts
Pages
OpenVMS V7.3-2 on node B 12-JUL-2005 14:29:28.26 Uptime 39 21:37:35
Pid Process Name State Pri I/O CPU Page flts
Pages
23D7E642 _TNA4059: SUSP 0 3636 0 00:03:51.03 4264
146
OpenVMS V7.3-2 on node D 12-JUL-2005 14:29:28.39 Uptime 40 22:08:19
Pid Process Name State Pri I/O CPU Page flts
Pages
23A009B2 Epoch SUSP 2101415952 0 14:16:04.80 7929
8079
23A2441F DALE CUR 2 4 3185385 0 02:13:09.77 43266
5397
23BABE8D 0323 CUR 4 5 215645 0 00:01:56.57 654
641
OpenVMS V7.3-2 on node D 12-JUL-2005 14:29:28.51 Uptime 38 21:10:53
Pid Process Name State Pri I/O CPU Page flts
Pages
OpenVMS V7.3-2 on node E 12-JUL-2005 14:29:28.61 Uptime 47 00:19:58
Pid Process Name State Pri I/O CPU Page flts
Pages
OBSV #4 HP has identified one main problem as being in logical name
translation.
Here’s the status from the four nodes (from MONITOR IO).
CUR AVE MIN MAX
Log Name Translation Rate 198.66 906.97 0.00 9845.33
Log Name Translation Rate 3902.00 3896.64 0.00 15286.00
Log Name Translation Rate 2077.00 1341.27 0.00 13067.33
Log Name Translation Rate 1690.66 621.39 0.00 3901.33
On the ES45’s, I could execute a procedure containing 1000 logical
name translations in a split second. On the GS1280 nodes, the
same procedure requires from several to 10 seconds.
OBSV #5 I ran Autogen with feedback and a couple of items stood out.
…
The feedback data is based on 931 hours of up time.
…
MSCP_BUFFER parameter information:
Feedback information.
Old value was 1300, New value is 1300
MSCP server I/O rate: 367 I/Os per 10 sec.
I/Os that waited for buffer space: 1564
I/Os that fragmented into multiple transfers: 3276
** WARNING ** - The value of MIN_IMGIOCNT specified
(228) is less than the minimum permissible value of 332.
Please review the MODPARAMS definitions of MIN_IMGIOCNT.
MIN_IMGIOCNT will be set to 332.
Additional nodes:
1. System parameter values on the GS1280 nodes are at the
same or higher level than on the ES45’s.
2. Our disk defragger has not been running for the last
6 months due to it breaking on SAN disks with
unexpected values coming from DVE
(logical/total/expansion).
The vendor has fixed the problem, but I
haven’t implemented defragging yet.
However, I have not noticed any disk queue
length problems.
3. The number of processes running on the cluster
has not increased appreciably since the migration
to the GS1280’s.
4. HP’s T4 has been installed on the four production nodes
but not yet running.
5. HP’s System Health Check has been installed on the four
production nodes but not yet running.
Has anyone else experienced similar response issues?
Any ideas?
- Next message: AEF: "Re: Now the UK Has it's own date in the history of al quiada attacks"
- Previous message: Hoff Hoffman: "Re: Facility Prefix Registration (was: Re: A tale of two macro libraries LIB vs STARLET"
- Next in thread: Keith Parris: "Re: Response issues on GS1280, VMS 7.3-2"
- Reply: Keith Parris: "Re: Response issues on GS1280, VMS 7.3-2"
- Reply: Hein: "Re: Response issues on GS1280, VMS 7.3-2"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Relevant Pages
|