Re: NIM thread blocked
- From: "Stefan.Gocke@xxxxxxxxxxx" <Stefan.Gocke@xxxxxxxxxxx>
- Date: Wed, 25 Feb 2009 10:29:32 +0100
Hello Holger,
I may not have made everything clear. Being a thread did not get the CPU
does not mean you are CPU constrained. NIM-Thread blocked is an I/O
problem.
The Network Interface Module is waiting to get the CPU to write a
heartbeat. It is not getting the CPU, because the I/O resource is
constrained - so the scheduling algorythm will not assign one of the
many CPU's, which have nothing to do, because some higher priority
thread is allways getting the CPU to do it's I/O.
Remember, that HACMP and RSCT are adding CPU / Memory and I/O to your
system to monitor for problems, but !! your application has a higher
priority than HACMP and RSCT. You build your cluster for the application
and not because you like HACMP or you like clusters :-).
In your case adding additional I/O paths could help prevent such a
situation. Adding another heartbeat network (no matter if IP or non-IP)
could help. I can't really comment on your system and application. But
if you want to get into NIM-Tread blocked situations, doing an LAN based
backup from your disks, can help get into those situations, especially
if you are using either etherchannel or virtual ethernet to do the
backup.
I would, like others suggest to open a call with IBM to do a root cause
analysis, why the failover occurred (if you still have the logs). As
soon as somethings like this happens, at least save a snap of you
cluster for support. And also saving a "perfpmr" once a month to have
something to compare against does make a lot of sense.
My guess would still be I/O contention.
I hope this helps.
Regards, Stefan
--
Stefan Gocke
e-mail: Stefan.Gocke@xxxxxxxxxxx
- IBM Certified Systems Expert - HACMP, Virtualization ...
- IBM Certified Deployment Professional - IBM TSM V5.2+IBM TSM V5.3
- IBM Certified Advanced Technical Expert - System p 2006, AIX5L, AIX4
-----Original Message-----
Date: Tue, 24 Feb 2009 14:55:43 +0100
Subject: Re: NIM thread blocked
From: Holger van Koll <Holger.vanKoll@xxxxxxxxxxxx>
To: aix-l@xxxxxxxxxxxxx
the problem is that there are so many "sometimes" in this situation
sometimes only a disk-heartbeat is blocked
sometimes only a network-heartbeat is blocked
sometimes both
sometimes there is one entry in errlog... i ignore it
sometimes there are 3-4 errors
2 days ago 2 nodes of the same cluster were starting to log those
errors
every node reported nim_threads blocked for 40 seconds
finally they didnt see each other anymore... standby took over but
primary didnt notice... when that situation went away a dms was
triggered
this cluster is a 64 cpu p595
how can 4 (nim-)threads be blocked for 40 seconds on a system having
64 (physical) cpus???
it is NOT a ntp problem. time is in sync and is syncronized 2 times
a day with ntpdate/cron
(you can easily trigger this error in errpt by giving a kill -17 to
the hats-proc, waiting 30 seconds and give a kill -19 to it. so ntp
could be a problem... but isnt)
once when this error came I was logged in one the node and hat a
vmstat running
root@sbpsgava01:/root >
errpt|head
IDENTIFIER TIMESTAMP T C RESOURCE_NAME
DESCRIPTION
3D32B80D 1030182208 P S topsvcs NIM thread
blocked
3D32B80D 1030182208 P S topsvcs NIM thread blocked
--> error at 18:22
now look at vmstat-output:
System configuration: lcpu=12 mem=28672MB
ent=6.00
kthr
memory
page faults
cpu
time
----------- ---------------------
--------------------------------- --- ------------------
----------------------- --------
r b p avm fre fi fo pi
po fr sr in sy cs us sy id wa pc
ec hr mi se
1 1 0 4387395 2332220 53 35 0 0
68 821 69 10613 445 9 1 89 2 0.59 9.9
18:18:03
2 4 0 4387482 2332188 16 139 0 0
96 1035 205 6263 659 4 1 94 1 0.33 5.6
18:19:03
2 1 0 4389805 2330376 38 70 0 0
63 723 117 6636 513 10 1 87 1 0.71 11.9
18:20:03
3 1 0 4392562 2327118 12 47 0 0
28 314 79 7237 450 3 1 95 0 0.27 4.5
18:21:03
6 1 0 4388433 2331413 23 53 0 0
52 634 90 6105 499 3 1 95 1 0.29 4.8
18:22:03
5 1 0 4377480 2342374 46 46 0 0
74 1055 102 3373 551 4 1 94 1 0.32 5.3
18:23:03
2 1 0 4388646 2330929 132 56 0 0
156 2203 122 16122 596 12 1 82 4 0.84 13.9
18:24:03
2 1 0 4391073 2328497 81 44 0 0
104 1632 139 22069 647 13 2 82 3 0.94 15.7
18:25:03
1 1 0 4395142 2324464 108 30 0 0
119 1667 102 18325 564 13 1 82 4 0.87 14.4
18:26:03
2 1 0 4362831 2356799 88 34 0 0
104 1655 103 10019 503 7 1 88 3 0.51 8.5
18:27:03
6 physical cpu. 12 logical. only 6 running and 1 blocked process at
18:22.
94 % idle!!
so the often heard response from ibm to this problem (that goes
"there were too much load on the system") cannot convince me
-----Original Message-----
From: IBM AIX Discussion List [mailto:aix-l@xxxxxxxxxxxxx [1]] On
Behalf Of Stefan.Gocke@xxxxxxxxxxx
Sent: Tuesday, February 24, 2009 2:38 PM
To: aix-l@xxxxxxxxxxxxx
Subject: Re: NIM thread blocked
Hello Holger,
this most ofthenly comes from the disk-heartbeat when backup runs.
Is the NIM-THREAD blocked from the disk-heartbeat or the LAN
heartbeat?
Does it occur on all interfaces? then it's time to really do
something.
And there is an old error in some releases/ptfs of HACMP that had a
problem when the automatic cluster verify runs. I've seen cluster
where 400 hdisks had this error. It happend when the automatic
verification from the node NOT haveing the the disks ran
verification. That was an error in programming, not a real error. If
you ran manual verification it didn't happen.
When this occurs, the system did not give the CPU to the heartbeat
process to write it's heartbeat, because a higher priority thread was
blocking access to that device. If it happens often - I normally
suggest to add another fiberchannel adapter (if disk heartbeat).
As long as all other hearbeats are working normally and if it's just
sporadic .... ignore for now and monitor that it doesn't happen too
often.
Regards. Stefan
--
Stefan Gocke
e-mail: Stefan.Gocke@xxxxxxxxxxx
- IBM Certified Systems Expert - HACMP, Virtualization ...
- IBM Certified Deployment Professional - IBM TSM V5.2+IBM TSM V5.3
- IBM Certified Advanced Technical Expert - System p 2006, AIX5L,
AIX4
-----Original Message-----
Date: Tue, 24 Feb 2009 13:35:24 +0100
Subject: NIM thread blocked
From: Holger van Koll
To: aix-l@xxxxxxxxxxxxx
Hello, on about 60 systems (proably all that have hacmp
running) I get entries in errpt like these: 3D32B80D 16-02-09
00:05 P S topsvcs NIM thread blocked Details in
errlog tell that those nim-threads (one per heartbeat) have been
blocked for a certain amount of time, can be 5 seconds, can be 50.
When I look at performance-logging tools (like patrol or even simple
vmstat commands that were running) I see that those commands have
been blocked for approximately the same amount of time. So,
something on some of my nodes prevents tasks to be executed. The
nodes vary from 64 cpu p595 to partitions with 0.5 cpu. The errors
come without and regularity. One night 5 come. Then its quiet for days
or weeks. Does anybody have an idea or at least a similar
situation? Regs, Holger
Links:
------
[1] mailto:aix-l@xxxxxxxxxxxxx
- References:
- NIM thread blocked
- From: Holger van Koll
- Re: NIM thread blocked
- From: Stefan.Gocke@xxxxxxxxxxx
- Re: NIM thread blocked
- From: Holger van Koll
- NIM thread blocked
- Prev by Date: Re: NIM thread blocked
- Next by Date: Re: NIM thread blocked
- Previous by thread: Re: NIM thread blocked
- Next by thread: HACMP question
- Index(es):
Relevant Pages
|