Re: NIM thread blocked



Hello Holger,

I may not have made everything clear. Being a thread did not get the CPU
does not mean you are CPU constrained. NIM-Thread blocked is an I/O
problem.

The Network Interface Module is waiting to get the CPU to write a
heartbeat. It is not getting the CPU, because the I/O resource is
constrained - so the scheduling algorythm will not assign one of the
many CPU's, which have nothing to do, because some higher priority
thread is allways getting the CPU to do it's I/O.

Remember, that HACMP and RSCT are adding CPU / Memory and I/O to your
system to monitor for problems, but !! your application has a higher
priority than HACMP and RSCT. You build your cluster for the application
and not because you like HACMP or you like clusters :-).

In your case adding additional I/O paths could help prevent such a
situation. Adding another heartbeat network (no matter if IP or non-IP)
could help. I can't really comment on your system and application. But
if you want to get into NIM-Tread blocked situations, doing an LAN based
backup from your disks, can help get into those situations, especially
if you are using either etherchannel or virtual ethernet to do the
backup.

I would, like others suggest to open a call with IBM to do a root cause
analysis, why the failover occurred (if you still have the logs). As
soon as somethings like this happens, at least save a snap of you
cluster for support. And also saving a "perfpmr" once a month to have
something to compare against does make a lot of sense.

My guess would still be I/O contention.

I hope this helps.
Regards, Stefan
--
Stefan Gocke
e-mail: Stefan.Gocke@xxxxxxxxxxx
- IBM Certified Systems Expert - HACMP, Virtualization ...
- IBM Certified Deployment Professional - IBM TSM V5.2+IBM TSM V5.3
- IBM Certified Advanced Technical Expert - System p 2006, AIX5L, AIX4

-----Original Message-----
Date: Tue, 24 Feb 2009 14:55:43 +0100
Subject: Re: NIM thread blocked
From: Holger van Koll <Holger.vanKoll@xxxxxxxxxxxx>
To: aix-l@xxxxxxxxxxxxx

   

the problem is that there are so many "sometimes" in this  situation

sometimes only a disk-heartbeat is blocked
sometimes only a  network-heartbeat is blocked
sometimes both
sometimes there is one entry  in errlog... i ignore it
sometimes there are 3-4 errors

2 days ago 2  nodes of the same cluster were starting to log those
errors
every node  reported nim_threads blocked for 40 seconds

finally they didnt see each  other anymore... standby took over but
primary didnt notice... when that  situation went away a dms was
triggered

this cluster is a 64 cpu  p595
how can 4 (nim-)threads be blocked for 40 seconds on a system having
64  (physical) cpus???

it is NOT a ntp problem. time is in sync and is  syncronized 2 times
a day with ntpdate/cron

(you can easily trigger this  error in errpt by giving a kill -17 to
the hats-proc, waiting 30 seconds and  give a kill -19 to it. so ntp
could be a problem... but isnt)

once when  this error came I was logged in one the node and hat a
vmstat  running

root@sbpsgava01:/root >
errpt|head                                    
IDENTIFIER  TIMESTAMP  T C RESOURCE_NAME  
DESCRIPTION                  
3D32B80D    1030182208 P S topsvcs        NIM thread
blocked            
3D32B80D    1030182208 P S topsvcs        NIM thread  blocked    

--> error at 18:22

now  look at vmstat-output:

System configuration:  lcpu=12 mem=28672MB
ent=6.00                    
   kthr            
memory                                              
page                        faults                
cpu                  
time                                                                  
-----------  ---------------------
---------------------------------      --- ------------------
-----------------------  --------                
  r   b   p         avm        fre     fi    fo    pi    
po                fr     sr    in      sy    cs us sy id wa    pc  
ec hr  mi se          
  1   1   0    4387395     2332220    53    35      0     0    
68         821     69  10613   445  9  1 89  2  0.59    9.9
18:18:03                
  2   4   0    4387482     2332188    16   139      0     0    
96         1035    205   6263   659  4  1 94  1   0.33   5.6
18:19:03              
  2   1   0    4389805     2330376    38    70      0     0    
63         723    117   6636   513 10  1 87  1  0.71  11.9
18:20:03                
  3   1   0    4392562     2327118    12    47      0     0    
28         314     79   7237   450  3  1 95  0   0.27   4.5
18:21:03                
  6   1   0    4388433     2331413    23    53      0     0    
52         634     90   6105   499  3  1 95  1   0.29   4.8
18:22:03                
  5   1   0    4377480     2342374    46    46      0     0    
74         1055    102   3373   551  4  1 94  1   0.32   5.3
18:23:03              
  2   1   0    4388646     2330929   132    56      0     0  
156         2203   122   16122   596 12  1 82  4  0.84  13.9
18:24:03              
  2   1   0    4391073     2328497    81    44      0     0  
104         1632   139   22069   647 13  2 82  3  0.94  15.7
18:25:03              
  1   1   0    4395142     2324464   108    30      0     0  
119         1667   102   18325   564 13  1 82  4  0.87  14.4
18:26:03              
  2   1   0    4362831     2356799    88    34      0     0  
104         1655   103   10019   503  7  1 88  3  0.51   8.5
18:27:03              

6  physical cpu. 12 logical. only 6 running and 1 blocked process at
18:22.  

94 % idle!!

so the often heard response from ibm to this problem (that goes
"there were too much load on the system") cannot convince me

 

-----Original Message-----
From: IBM AIX  Discussion List [mailto:aix-l@xxxxxxxxxxxxx [1]] On
Behalf Of  Stefan.Gocke@xxxxxxxxxxx
Sent: Tuesday, February 24, 2009 2:38 PM
To:  aix-l@xxxxxxxxxxxxx
Subject: Re: NIM thread blocked

Hello  Holger,

this most ofthenly comes from the disk-heartbeat when backup  runs.

Is the NIM-THREAD blocked from the disk-heartbeat or the LAN
heartbeat?
Does it occur on all interfaces? then it's time to really do
something.

And there is an old error in some releases/ptfs of HACMP that  had a
problem when the automatic cluster verify runs. I've seen cluster
where  400 hdisks had this error. It happend when the automatic
verification from the  node NOT haveing the the disks ran
verification. That was an error in  programming, not a real error. If
you ran manual verification it didn't  happen.

When this occurs, the system did not give the CPU to the  heartbeat
process to write it's heartbeat, because a higher priority thread was
blocking access to that device. If it happens often - I normally
suggest to add  another fiberchannel adapter (if disk heartbeat).

As long as all other  hearbeats are working normally and if it's just
sporadic .... ignore for now and  monitor that it doesn't happen too
often.

Regards.  Stefan

--
Stefan Gocke
e-mail: Stefan.Gocke@xxxxxxxxxxx
- IBM  Certified Systems Expert - HACMP, Virtualization ...
- IBM Certified  Deployment Professional - IBM TSM V5.2+IBM TSM V5.3
- IBM Certified Advanced  Technical Expert - System p 2006, AIX5L,
AIX4

-----Original  Message-----
Date: Tue, 24 Feb 2009 13:35:24 +0100
Subject: NIM thread  blocked
From: Holger van Koll
To:  aix-l@xxxxxxxxxxxxx

        Hello,   on  about 60 systems (proably all that have hacmp
running) I get entries in errpt  like these:   3D32B80D   16-02-09
00:05  P S topsvcs         NIM thread blocked          Details  in
errlog tell that those nim-threads (one per heartbeat) have been
blocked for  a certain amount of time, can be 5 seconds, can be 50.
When I look at  performance-logging tools (like patrol or even simple
vmstat commands that were  running) I see that those commands have
been blocked for approximately the same  amount of time.   So,
something on some of my nodes prevents tasks to be  executed. The
nodes vary from 64 cpu p595 to partitions with 0.5 cpu.   The  errors
come without and regularity. One night 5 come. Then its quiet for days
or  weeks.   Does anybody have an idea or at least a similar
situation?    Regs, Holger  
     


Links:
------
[1] mailto:aix-l@xxxxxxxxxxxxx



Relevant Pages

  • IO WAIT Information From IBM
    ... the I/O wait metric in AIX. ... AIX scheduler, the CPU "queues", the CPU states, and the idle or wait ... To summarize it in one sentence, 'iowait' is the percentage ...
    (AIX-L)
  • userland starvation with 2.4.25-rc2
    ... I am trying to load the SCSI disk by doing in parallel: ... The upper graph is the CPU load, the lower graph is the I/O load. ... you can see, at certain points the kernel will take all available CPU, ...
    (Linux-Kernel)
  • Re: IO WAIT Information From IBM
    ... >understanding of how the I/O wait value is collected and calculated. ... >of the CPU resource. ... The wait processes only job is to increment the counters that report ... >Each CPU can be in one of four states: user, sys, idle, iowait. ...
    (AIX-L)
  • Re: IO WAIT Information From IBM
    ... >understanding of how the I/O wait value is collected and calculated. ... >of the CPU resource. ... The wait processes only job is to increment the counters that report ... >Each CPU can be in one of four states: user, sys, idle, iowait. ...
    (AIX-L)
  • Re: NIM thread blocked
    ... I've also seen these occasionally, usually if CPU is pegged, or we tweaked a backup to do more I/Os /sec.. ... days ago 2 nodes of the same cluster were starting to log those ... Is the NIM-THREAD blocked from the disk-heartbeat or the LAN heartbeat? ...
    (AIX-L)