Re: vm stat analysis - slow running batch job



Praveen,

Thank you for your reply. I have attached the first three hours of the
lparstat analysis for the run above and as you can see, there is no
overload on the CPU. However, the %wait column goes up significantly
when the job runs.

Time %user %sys %wait %idle physc %entc lbusy vcsw phint
12:00 13.4 1.5 2.2 82.9 0.97 16.2 7.6 1880 34
12:05 10.9 1.8 2.8 84.5 0.85 14.1 6.6 3000 31
12:10 10.6 1.9 3.9 83.6 0.83 13.9 6.6 2930 23
12:15 7.5 5 40 47.5 0.86 14.4 6.7 3340 39
12:20 2 6.5 39.6 51.8 0.62 10.3 4.8 3639 39
12:25 1.9 8.8 41.1 48.3 0.76 12.7 5.9 5115 43
12:30 1.9 10.5 41.2 46.4 0.88 14.7 7 5627 55
12:35 1.9 9.8 39.5 48.8 0.82 13.7 6.3 4627 50
12:40 1.7 10.9 38.3 49.1 0.88 14.7 6.9 4874 49
12:45 1.7 8 30.8 59.4 0.68 11.4 5.3 4186 34
12:50 1.5 10.4 41.5 46.6 0.84 13.9 6.4 4708 53
12:55 2 8.6 29.5 59.8 0.75 12.4 5.8 4531 45
13:00 1.4 10.1 38.4 50.1 0.81 13.5 6.3 4924 40
13:05 2.1 9.3 32.9 55.7 0.81 13.5 6.3 5627 49
13:10 1.5 8.7 33.5 56.2 0.73 12.1 5.7 5042 48
13:15 1.8 10.4 37.1 50.6 0.86 14.3 6.7 4886 43
13:20 1.9 7.8 34.1 56.2 0.69 11.5 5.3 5011 46


I have also attached the vmstat output for the previous run:

kthr memory page faults cpu
Time r b avm fre re pi po fr sr cy in sy cs us sy id wa pc ec
16:30 1 1 3476031 5955 0 0 0 475 662 0 479 3467 2822 9 2 86 3 0.76 12.6
16:35 1 1 3461033 416829 0 1 1 437 795 0 526 2249 1900 12 2 79 7 0.9 15.1
16:40 1 1 3463570 150256 0 0 0 0 0 0 557 1103 1370 3 1 88 8 0.27 4.6
16:45 1 1 3463401 5784 0 0 0 713 1402 0 320 952 1037 1 1 90 8 0.17 2.8
16:50 1 1 3463586 5805 0 5 13 1303 2281 0 487 836 1846 1 1 90 7 0.22 3.7
16:55 1 1 3463412 5945 0 6 16 1393 2355 0 472 1471 1701 2 2 89 7 0.24 4
17:00 1 1 3463553 5915 0 52 16 1217 2560 0 378 1251 1132 1 1 89 9 0.18 3.1
17:05 1 1 3463414 6198 0 18 6 1308 1968 0 455 1229 1659 2 1 89 8 0.22 3.7
17:10 1 1 3463515 5823 0 5 7 1370 2554 0 414 1630 1403 2 1 90 7 0.22 3.7
17:15 1 1 3463427 6115 0 7 27 1527 2755 0 520 1659 1804 2 2 89 7 0.26 4.4
17:20 1 1 3463467 6307 0 13 27 1315 2279 0 380 824 1131 1 1 90 7 0.18 3.1
17:25 1 1 3461800 6426 0 24 64 1427 3145 0 550 1184 1840 2 2 89 7 0.24 4
17:30 1 1 3463511 5827 0 23 49 1410 2597 0 472 1137 1478 1 1 90 7 0.21 3.6
17:35 1 1 3463606 5829 0 24 24 1421 3094 0 485 891 1638 1 1 90 7 0.22 3.6
17:40 1 1 3461830 5956 0 30 27 1390 2839 0 462 1369 1486 2 1 90 7 0.23 3.8
17:45 1 1 3462928 6208 0 35 33 3709 7917 0 416 879 1509 5 3 86 6 0.53 8.8

We would expect the vmstat profile to be similar to what I've shown
above but it is significantly different for the most recent run. What
we are trying to understand is why this job ran in seven hours at the
weekend, when previous runs have always completed in just over an hour.
We think the cause is I/O related because of the high %WAIT times shown
in the lparstat column and also because the addm database report
indicates User I/O was consuming significant database time.

As I mentioned above, there have been no
hardware/configuration/database/code changes between runs. The tables
concerned do grow month on month, but not significantly. We collect
statistics regularly for each run. The files are on jfs2 filesystems in
logical volumes and are RAID 5 striped. I do not know the maxperm and
minperm settings, but as I've said above, these have not changed
between runs.

Thanks,

Tony

We think the cause is IO related because of the high %WAIT times shown
in the lparstat column and also because the addm database report
indicates User I/O was significant database time., and

Thanks,

Tony

.