Re: Problems with SCO 5.0.6 and Informix 7.12 (long)
From: Bela Lubkin (belal_at_sco.com)
Date: 01/28/05
- Previous message: Yashwant Singh Rana: "Problem in connecting two SCO servers"
- In reply to: Rob: "Problems with SCO 5.0.6 and Informix 7.12 (long)"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Date: 28 Jan 2005 12:59:12 -0500
Roberto Zini wrote:
> I'd like to share with the groups a problem a customer of ours is facing
> on a 5.0.6 box with an old Informix 7.12 database (operating on a RAW
> partition).
> SCO OS 5.0.6 + SMP + RS506A + OSS648A + OSS651A + OSS644B + OSS650A.
>
> The box is an IBM Server xSeries 255 (8685-C1X) with an IBM CONTROLLER
> ServeRAID-6M (drivers v 7.10); the disks are arranged in a RAID-5 array
> (HW driven).
>
> This box is a dual Xeon 3.2 Ghz box with 4GB of RAM.
>
> This server hosts approx 350 users with an account package written in 4GL.
>
> Approx once or twice a day, the database server (namely, a couple of
> instances of "oninit") ends up taking approx 99% of the CPU usage so
> forcing the administrator to manually "kill" (-9) these processes and
> issue an "oninit" to get the engine back to work.
>
> One way to trigger the hang is the parallel execution of a HUGE query
> (which handles several MILLIONS records); as soon as the operation
> starts, "sar -U" reports an high %wio value but the system does not slow
> too much. After approx 30 to 40 minutes after, the HD lights (which were
> pretty "wild" during the query) return to a more normal flashing and the
> above oninit instances are frozen.
> For those of you who like reading on, here are some excerpts from the
> "mpsar -U" command (along with my comments); the massive query started
> at approx 15:00 and the oninit processed did hang at approx 15:55.
>
> 08:00:01 %usr %sys %wio %idle (-u)
> 14:40:00 9 5 9 77
> 14:45:00 13 6 50 31
> 14:50:00 16 6 67 11
> 14:55:00 26 8 61 5
> 15:00:00 26 8 60 6
> 15:05:00 26 8 62 5
> 15:10:00 30 7 60 2
> 15:15:00 20 7 70 4
> 15:20:00 20 7 70 3
> 15:25:00 18 7 71 3
> 15:30:00 20 8 68 4
> 15:35:00 18 7 71 4
> 15:40:00 68 3 28 1
> 15:45:00 99 1 0 0
> 15:50:00 99 1 0 0
> 15:55:00 98 2 0 0
> 16:00:00 98 1 0 1
> The oninit processes got complete control over the CPU utilization; the
> high %usr value makes me think about a looping (bug) condition of the
> engine.
Looks like that.
> 08:00:01 msg/s sema/s (-m)
> 14:40:00 0.00 2563.73
> 14:45:00 0.00 4192.57
> 14:50:00 0.00 5048.27
> 14:55:00 0.00 7597.28
> 15:00:00 0.00 7247.46
> 15:05:00 0.00 7148.11
> 15:10:00 0.00 5934.86
> 15:15:00 0.00 5720.28
> 15:20:00 0.00 5669.58
> 15:25:00 0.00 4660.03
> 15:30:00 0.00 5560.06
> 15:35:00 0.00 4718.81
> 15:40:00 0.00 2091.79
> 15:45:00 0.00 12.82
> 15:50:00 0.00 12.79
> 15:55:00 0.00 13.10
> 16:00:00 0.00 12.67
We can see here that the database engine normally does a lot of
semaphore activity, but once it gets into this bad state, it stops.
It's looping in core.
> 08:00:01 scall/s sread/s swrit/s fork/s exec/s rchar/s wchar/s (-c)
> 14:40:00 47760 14680 333 6.73 7.87 8205354 87337
> 14:45:00 46093 13535 322 6.06 7.05 8468079 99356
> 14:50:00 32606 8562 321 4.07 5.07 5434781 91777
> 14:55:00 37005 8743 519 3.63 4.57 5372965 174227
> 15:00:00 39274 9617 524 4.26 5.30 5978026 167135
> 15:05:00 37010 8952 485 3.54 4.65 5916037 152972
> 15:10:00 40199 10591 444 3.64 4.69 6629425 147242
> 15:15:00 34367 8692 413 3.19 4.12 5436009 121322
> 15:20:00 42047 11255 429 5.18 6.30 6784568 157546
> 15:25:00 42976 12003 390 4.14 4.75 7369400 148823
> 15:30:00 55452 15765 390 4.37 5.06 9125374 133925
> 15:35:00 45569 12885 335 3.23 3.73 7570264 100898
> 15:40:00 14034 3503 248 5.50 5.66 2033284 80457
> 15:45:00 3519 823 260 6.73 6.45 342912 82584
> 15:50:00 3104 572 219 6.05 5.78 246789 59167
> 15:55:00 4157 856 406 9.71 9.28 433506 187387
> 16:00:00 2938 577 240 6.42 6.12 315012 97352
Likewise, when it's distracted by this spin, it stops doing so many
system calls.
It definitely looks like a spin inside the database user process.
Both `truss` and `trace` can attach to a running process, to show you
the system calls it is doing. Try those. It's likely that they'll show
no calls being made.
`dbx` can also attach to a running process; if you tell it to step by
instruction, you can watch the process spin for a while. If you see it
looping through the same instructions repeatedly, that's a hint. You
won't be able to make too much sense of it without the source, but a
loop is a loop...
>Bela<
- Previous message: Yashwant Singh Rana: "Problem in connecting two SCO servers"
- In reply to: Rob: "Problems with SCO 5.0.6 and Informix 7.12 (long)"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Relevant Pages
|