IO WAIT Information From IBM

From: Jason delaFuente (jason.delafuente_at_GBE.COM)
Date: 04/08/04

  • Next message: Jason delaFuente: "Re: move SAN disks to move LPAR"
    Date:         Thu, 8 Apr 2004 09:52:26 -0500
    To: aix-l@Princeton.EDU
    
    

    This is from a PDF and text file written by one of the IBM Performance Specialists. I don't think I can send attachments to the list so I have pasted everything here. It contains some tables so I have tried to format as best as possible in this email:

    A great deal of controversy exists around the interpretation of
    the I/O wait metric in AIX. This number shows up at the rightmost "wa"
    column in vmstat output, the "% iowait" column in iostat, the %wio column
    in the sar -P , and the ascii bar graph titled "wait" in topas. Confusion exists
    when I/O wait is evaluated for performance or capacity planning as to
    whether this number should be considered CPU cycles that are used or
    cycles that should be added to the system idle time indicating unused
    capacity. This paper will explain how this metric is captured and calculated
    as well as provide a "case study" example to illustrate the effects.
    A review of some of the basic AIX functions will assist in a better
    understanding of how the I/O wait value is collected and calculated. The
    AIX scheduler, the CPU "queues", the CPU states, and the idle or wait
    process, will be discussed.
    The scheduler is a part of the AIX kernel that is tasked with making sure the
    individual CPUs have work to do and in the case where there are more
    runnable jobs (threads) than CPUs, to make sure each one gets its fair share
    of the CPU resource. The system contains a hardware timer which
    generates 100 interrupts/second. This interrupt will then dispatch the kernel
    scheduler process which runs at a fixed priority of 16. The scheduler will
    first charge the running thread with the 10 millisecond time slice and then
    dispatch another thread (context switch) of equal or higher priority on that
    CPU assuming there are other runnable threads. This short term CPU usage
    Demystifying I/O Wait
    Harold Lee - ATS 12/11/2002 Page 1

    Partial ps command output showing short term CPU usage in "C' column
    #ps -aekl
    F S UID PID PPID C PRI NI ADDR SZ WCHAN TTY TIME CMD
    303 A 0 0 0 120 16 -- 12012 12 - 4:17 swapper
    200003 A 0 1 0 0 60 20 c00c 732 - 0:22 init
    303 A 0 516 0 120 127 -- 13013 8 - 31972:23 kproc
    303 A 0 774 0 120 127 -- 14014 8 - 31322:34 kproc
    303 A 0 1032 0 0 16 -- 17017 12 - 0:00 kproc
    303 A 0 1290 0 0 36 -- 1e01e 16 - 0:32 kproc
    303 A 0 1548 0 0 37 -- 1f01f 64 * - 5:09 kproc
    303 A 0 1806 0 0 60 -- c02c 16 3127c558 - 0:04 kproc
    240001 A 0 2638 1 0 60 20 12212 108 3127ab98 - 102:04 syncd

    is reported in the "C" column when including a -l option with the ps
    command.

    One hundred times a second, the scheduler will take the process that is
    currently running on each CPU, and increment the "C" value by one. It will
    then recalculate that processes priority and rescan the process table looking
    for the next process to dispatch. If there are no runnable processes the
    scheduler will dispatch the "idle" kernel process. There are one of these
    assigned to each CPU and are bound to that particular processor. The
    following output shows a four way system with four wait processes each
    bound to a CPU.
    THREAD TABLE :
    SLT ST TID PID CPUID POLICY PRI CPU EVENT PROCNAME
    0 s 3 0 bound FIFO 10 78 swapper
    flags: kthread
    Demystifying I/O Wait
    Harold Lee - ATS 12/11/2002 Page 2

    1 s 103 1 unbound other 3c 0 init
    flags: local wakeonsig cdefer
    unknown: 0x10000
    2 r 205 204 0 FIFO ff 78 wait
    flags: funnel kthread
    3 r 307 306 1 FIFO ff 78 wait
    flags: funnel kthread
    4 r 409 408 2 FIFO ff 78 wait
    flags: funnel kthread
    5 r 50b 50a 3 FIFO ff 78 wait
    flags: funnel kthread
    6 s 60d 60c unbound RR 11 2b reaper
    Also notice that the wait process priority is 0Xff. The MSB has been turned
    off to give a priority range of 0-127 on AIX 5.1 and lower. In AIX 5.2 and
    higher the range of priorities has been increased to 255 to allow more
    granularity for control when using Workload Manager (WLM).
    If there are no processes to dispatch, the scheduler will dispatch the "wait"
    process which will run until any other process becomes runnable at which
    time it will immediately be dispatched since it will always have a higher
    priority. The wait processes only job is to increment the counters that report
    if that particular processor is "idle" or "waiting for I/O". It is important to
    remember that the "waiting for I/O" metric is incremented by the idle
    process. The decision on whether the idle process decides to increment the
    "idle" counter or the "waiting for I/O counter depends on whether there is a
    process sitting in the blocked queue. Processes which are runnable but
    waiting on data from a disk are placed on the blocked queue to wait for their
    data. If no processes are sitting on that particular processors blocked queue,
    then the wait process will charge the time to "idle". If there are one or more
    processes on that particular processors blocked queue, then the system
    Demystifying I/O Wait
    Harold Lee - ATS 12/11/2002 Page 3

    charges the time to "waiting for I/O". Waiting for I/O is considered to be a
    special case of idle and therefore the percentage of time spent in waiting for
    I/O is usable for process to perform work.
    A case study will be presented to illustrate this concept. Consider a single
    CPU system the has two tasks to perform. Task a is a CPU intensive
    program and task B is an I/O intensive program. The effects of these
    programs on the vmstat output will be considered separately and then
    combined.
    Task "A" which is CPU intensive is run on a single CPU system, the
    majority of the CPU time will be spent in the "user" (us) mode. The vmstat
    output below reflects the effects of a single process running.
    $ vmstat 1
    kthr memory page faults cpu
    ----- ----------- ------------------------ ------------ -----------
    r b avm fre re pi po fr sr cy in sy cs us sy id wa
    1 0 106067 164605 0 0 0 0 23 0 232 835 411 99 0 0 0
    1 0 106072 164600 0 0 0 0 0 0 239 2543 413 99 1 0 0
    1 0 106072 164600 0 0 0 0 0 0 234 2425 403 99 0 0 0
    1 0 106072 164600 0 0 0 0 0 0 235 2426 405 98 2 0 0
    1 0 106072 164600 0 0 0 0 0 0 241 2572 428 99 1 0 0
    1 0 106072 164600 0 0 0 0 0 0 233 2490 475 99 0 0 0
    Task "B" which is I/O intensive is run on a single CPU system, the majority
    of the CPU time will be spent in the "waiting for I/O" (wa) mode. The vmstat
    output below reflects the effects of a single process running.
    $ vmstat 1
    kthr memory page faults cpu
    ----- ----------- ------------------------ ------------ -----------
    r b avm fre re pi po fr sr cy in sy cs us sy id wa
    0 1 106067 164605 0 0 0 0 23 0 232 835 411 0 1 0 99
    0 1 106072 164600 0 0 0 0 0 0 239 2543 413 0 1 0 99
    Demystifying I/O Wait
    Harold Lee - ATS 12/11/2002 Page 4

    0 1 106072 164600 0 0 0 0 0 0 234 2425 403 0 1 0 99
    1 1 106072 164600 0 0 0 0 0 0 235 2426 405 0 2 0 98
    0 1 106072 164600 0 0 0 0 0 0 241 2572 428 0 1 0 99
    0 1 106072 164600 0 0 0 0 0 0 233 2490 475 0 1 0 99
    If while Task "B" which is I/O intensive is running, task "A" is started on a
    single CPU system, the majority of the CPU time will be spent in the "user"
    (us) mode. This shows that all of the CPU cycles spent in the "waiting for
    I/O" mode have been recovered and are usable by other processes. The
    vmstat output below reflects the effects of running a CPU intensive program
    and an I/O intensive simultaneously.
    $ vmstat 1
    kthr memory page faults cpu
    ----- ----------- ------------------------ ------------ -----------
    r b avm fre re pi po fr sr cy in sy cs us sy id wa
    1 1 106067 164605 0 0 0 0 23 0 232 835 411 99 0 0 0
    1 1 106072 164600 0 0 0 0 0 0 239 2543 413 99 2 0 0
    1 1 106072 164600 0 0 0 0 0 0 234 2425 403 99 0 0 0
    2 1 106072 164600 0 0 0 0 0 0 235 2426 405 98 1 0 0
    1 1 106072 164600 0 0 0 0 0 0 241 2572 428 99 1 0 0
    1 1 106072 164600 0 0 0 0 0 0 233 2490 475 99 0 0 0
    One item to note from this example. I/O bound systems cannot always be
    determined by looking at the "waiting for I/O" metrics only. A busy system
    can mask the effects of I/O bottlenecks. To determine if an I/O bottleneck
    exists, the blocked queue as well as the output from iostat must also be
    considered.
    Demystifying I/O Wait
    Harold Lee - ATS 12/11/2002 Page 5

    What exactly is "iowait"?

    To summarize it in one sentence, 'iowait' is the percentage
    of time the CPU is idle AND there is at least one I/O
    in progress.

    Each CPU can be in one of four states: user, sys, idle, iowait.
    Performance tools such as vmstat, iostat, sar, etc. print
    out these four states as a percentage. The sar tool can
    print out the states on a per CPU basis (-P flag) but most
    other tools print out the average values across all the CPUs.
    Since these are percentage values, the four state values
    should add up to 100%.

    The tools print out the statistics using counters that the
    kernel updates periodically (on AIX, these CPU state counters
    are incremented at every clock interrupt (these occur
    at 10 millisecond intervals).
    When the clock interrupt occurs on a CPU, the kernel
    checks the CPU to see if it is idle or not. If it's not
    idle, the kernel then determines if the instruction being
    executed at that point is in user space or in kernel space.
    If user, then it increments the 'user' counter by one. If
    the instruction is in kernel space, then the 'sys' counter
    is incremented by one.

    If the CPU is idle, the kernel then determines if there is
    at least one I/O currently in progress to either a local disk
    or a remotely mounted disk (NFS) which had been initiated
    from that CPU. If there is, then the 'iowait' counter is
    incremented by one. If there is no I/O in progress that was
    initiated from that CPU, the 'idle' counter is incremented
    by one.

    When a performance tool such as vmstat is invoked, it reads
    the current values of these four counters. Then it sleeps
    for the number of seconds the user specified as the interval
    time and then reads the counters again. Then vmstat will
    subtract the previous values from the current values to
    get the delta value for this sampling period. Since vmstat
    knows that the counters are incremented at each clock
    tick (10ms), second, it then divides the delta value of
    each counter by the number of clock ticks in the sampling
    period. For example, if you run 'vmstat 2', this makes
    vmstat sample the counters every 2 seconds. Since the
    clock ticks at 10ms intervals, then there are 100 ticks
    per second or 200 ticks per vmstat interval (if the interval
    value is 2 seconds). The delta values of each counter
    are divided by the total ticks in the interval and
    multiplied by 100 to get the percentage value in that
    interval.

    iowait can in some cases be an indicator of a limiting factor
    to transaction throughput whereas in other cases, iowait may
    be completely meaningless.
    Some examples here will help to explain this. The first
    example is one where high iowait is a direct cause
    of a performance issue.

    Example 1:
    Let's say that a program needs to perform transactions on behalf of
    a batch job. For each transaction, the program will perform some
    computations which takes 10 milliseconds and then does a synchronous
    write of the results to disk. Since the file it is writing to was
    opened synchronously, the write does not return until the I/O has
    made it all the way to the disk. Let's say the disk subsystem does
    not have a cache and that each physical write I/O takes 20ms.
    This means that the program completes a transaction every 30ms.
    Over a period of 1 second (1000ms), the program can do 33
    transactions (33 tps). If this program is the only one running
    on a 1-CPU system, then the CPU usage would be busy 1/3 of the
    time and waiting on I/O the rest of the time - so 66% iowait
    and 34% CPU busy.

    If the I/O subsystem was improved (let's say a disk cache is
    added) such that a write I/O takes only 1ms. This means that
    it takes 11ms to complete a transaction, and the program can
    now do around 90-91 transactions a second. Here the iowait time
    would be around 8%. Notice that a lower iowait time directly
    affects the throughput of the program.

    Example 2:

    Let's say that there is one program running on the system - let's assume
    that this is the 'dd' program, and it is reading from the disk 4KB at
    a time. Let's say that the subroutine in 'dd' is called main() and it
    invokes read() to do a read. Both main() and read() are user space
    subroutines. read() is a libc.a subroutine which will then invoke
    the kread() system call at which point it enters kernel space.
    kread() will then initiate a physical I/O to the device and the 'dd'
    program is then put to sleep until the physical I/O completes.
    The time to execute the code in main, read, and kread is very small -
    probably around 50 microseconds at most. The time it takes for
    the disk to complete the I/O request will probably be around 2-20
    milliseconds depending on how far the disk arm had to seek. This
    means that when the clock interrupt occurs, the chances are that
    the 'dd' program is asleep and that the I/O is in progress. Therefore,
    the 'iowait' counter is incremented. If the I/O completes in
    2 milliseconds, then the 'dd' program runs again to do another read.
    But since 50 microseconds is so small compared to 2ms (2000 microseconds),
    the chances are that when the clock interrupt occurs, the CPU will
    again be idle with a I/O in progress. So again, 'iowait' is
    incremented. If 'sar -P <cpunumber>' is run to show the CPU
    utilization for this CPU, it will most likely show 97-98% iowait.
    If each I/O takes 20ms, then the iowait would be 99-100%.
    Even though the I/O wait is extremely high in either case,
    the throughput is 10 times better in one case.

    Example 3:

    Let's say that there are two programs running on a CPU. One is a 'dd'
    program reading from the disk. The other is a program that does no
    I/O but is spending 100% of its time doing computational work.
    Now assume that there is a problem with the I/O subsystem and that
    physical I/Os are taking over a second to complete. Whenever the
    'dd' program is asleep while waiting for its I/Os to complete,
    the other program is able to run on that CPU. When the clock
    interrupt occurs, there will always be a program running in
    either user mode or system mode. Therefore, the %idle and %iowait
    values will be 0. Even though iowait is 0 now, that does not
    mean there is NOT a I/O problem because there obviously is one
    if physical I/Os are taking over a second to complete.

    Example 4:

    Let's say that there is a 4-CPU system where there are 6 programs
    running. Let's assume that four of the programs spend 70% of their
    time waiting on physical read I/Os and the 30% actually using CPU time.
    Since these four programs do have to enter kernel space to execute the
    kread system calls, it will spend a percentage of its time in
    the kernel; let's assume that 25% of the time is in user mode,
    and 5% of the time in kernel mode.
    Let's also assume that the other two programs spend 100% of their
    time in user code doing computations and no I/O so that two CPUs
    will always be 100% busy. Since the other four programs are busy
    only 30% of the time, they can share that are not busy.

    If we run 'sar -P ALL 1 10' to run 'sar' at 1-second intervals
    for 10 intervals, then we'd expect to see this for each interval:

             cpu %usr %sys %wio %idle
              0 50 10 40 0
              1 50 10 40 0
              2 100 0 0 0
              3 100 0 0 0
              - 75 5 20 0

    Notice that the average CPU utilization will be 75% user, 5% sys,
    and 20% iowait. The values one sees with 'vmstat' or 'iostat' or
    most tools are the average across all CPUs.

    Now let's say we take this exact same workload (same 6 programs
    with same behavior) to another machine that has 6 CPUs (same
    CPU speeds and same I/O subsytem). Now each program can be
    running on its own CPU. Therefore, the CPU usage breakdown
    would be as follows:

             cpu %usr %sys %wio %idle
              0 25 5 70 0
              1 25 5 70 0
              2 25 5 70 0
              3 25 5 70 0
              4 100 0 0 0
              5 100 0 0 0
              - 50 3 47 0

    So now the average CPU utilization will be 50% user, 3% sy,
    and 47% iowait. Notice that the same workload on another
    machine has more than double the iowait value.

    Conclusion:

    The iowait statistic may or may not be a useful indicator of
    I/O performance - but it does tell us that the system can
    handle more computational work. Just because a CPU is in
    iowait state does not mean that it can't run other threads
    on that CPU; that is, iowait is simply a form of idle time.

    Jason de la Fuente


  • Next message: Jason delaFuente: "Re: move SAN disks to move LPAR"

    Relevant Pages

    • Re: IO WAIT Information From IBM
      ... >understanding of how the I/O wait value is collected and calculated. ... >of the CPU resource. ... >Each CPU can be in one of four states: user, sys, idle, iowait. ... >The tools print out the statistics using counters that the ...
      (AIX-L)
    • Re: IO WAIT Information From IBM
      ... >understanding of how the I/O wait value is collected and calculated. ... >of the CPU resource. ... >Each CPU can be in one of four states: user, sys, idle, iowait. ... >The tools print out the statistics using counters that the ...
      (AIX-L)
    • ¦^«H¡G Re: IO WAIT Information From IBM
      ... >understanding of how the I/O wait value is collected and calculated. ... >of the CPU resource. ... >Each CPU can be in one of four states: user, sys, idle, iowait. ... >The tools print out the statistics using counters that the ...
      (AIX-L)
    • Re: ¦^«H¡G Re: IO WAIT Information From IBM
      ... >understanding of how the I/O wait value is collected and calculated. ... >of the CPU resource. ... >Each CPU can be in one of four states: user, sys, idle, iowait. ... >The tools print out the statistics using counters that the ...
      (AIX-L)
    • Re: CPU waiting for... what? (mistery)
      ... That says to me that the box is idle... ... An individual database process may not require CPU. ... CPU wait states say almost nothing about I/O. ... The service time showed by iostat ...
      (comp.unix.solaris)