Re: Sar questions

From: Tony Lawrence (apl_at_shell01.TheWorld.com)
Date: 12/24/03

  • Next message: Jeff Liebermann: "Re: Sar questions"
    Date: Wed, 24 Dec 2003 18:13:47 +0000 (UTC)
    
    

    Chalawal Maliwan <chalawal@hotmail.com> wrote:
    >Below is my sar output. How can we know that what caused the cpu to
    >consume 100% CPU at all time?

    >#sar

    >00:00:00 %usr %sys %wio %idle (-u)
    >01:00:00 25 75 0 0
    >02:00:00 25 75 0 0
    >03:00:00 24 76 0 0
    >04:00:00 24 76 0 0
    >05:00:00 24 76 0 0
    >06:00:00 24 76 0 0

    Usually fairly simply. From http://aplawrence.com/Unixart/slow.html

    If it is the cpu that is pegged busy, it *may* be a run away process
    that is eating cpu cycles. Do this:

    for x in 1 2 3 4 5
    do
    ps -e | sort -r +2 | head -5
    echo "==="
    sleep 5
    done

    Look for a process who's time column has gone up by 3 to 5 seconds each
    time- if you have something like that, that's your problem- you need to
    kill it. The TIME column is time on the cpu- normally a process doesn't
    spend a great deal of time actually running- it's waiting for the disk,
    waiting for you to type something, etc. Most processes spend most of
    their time sleeping, waiting for something else to happen, so something
    that gains 3 seconds or more in 5 seconds of wall time is usually suspect.

    If you watch it over a few minutes, the time it gains here divided by
    the elapsed wall clock time is the percentage of your cpu this process
    is taking for itself. A shortlived process can take a lot of the cpu
    to print, or to redraw an X screen etc., so you have to use some good
    judgement here. But 3 seconds out of 5 is very likely a real problem.

    Of course you need to understand what you are killing: you probably
    wouldn't want to kill the main Oracle database, for example.

    If you kill the errant process and another copy of it pops right back
    to the top of the list, then you need to track down its parent:

    # for example, if process 15246 is the problem
    ps -p 15246 -o ppid

    Of course, it may go further up the chain. Here's a script that traces
    back to init:

    # This works on SCO or Linux, just pass a process ID as an argument.
     MYPROC=$1
     NEXTPROC=$MYPROC
     while [ $NEXTPROC != 0 ]
     do
        ps -lp $NEXTPROC
        MYPROC=$NEXTPROC
        NEXTPROC=`ps -p $MYPROC -o "ppid=" `
     done

    Sometimes you'll have a badly written network program that starts sucking
    resources when its client dies. If you can't get the supplier to fix it,
    you may want to write a script to track down and kill these things. One
    clue that might help: the difference between a good "xyz" process and a
    bad one might just be whether or not it has an attached tty. So, if you
    see this:

      5821 ? 00:00:42 xyz
      6689 ttyp0 00:00:08 xyz
      7654 ttyp1 00:00:12 xyz

    It's probably the one with a "?" that will start accumulating time. So
    a script that watched for and killed those might look like this:

    set -f
    # turn off shell expansion because of "?"
    ps -e | grep "xyz$" | while read line
    do
    set $line
    [ "$2" = "?" ] && kill -9 $1
    done

    If you can't do it that way, you have to get more clever, and watch for
    changing time:

    set -f
    mkdir /tmp/mystuff
    ps -e | grep "xyz$" | while read line
    do
    set $line
    ps -p $1 > /tmp/mystuff/first
    sleep 5
    #adjust sleep as necessary
    ps -p $1 > /tmp/mystuff/second
    diff /tmp/mystuff/first /tmp/mystuff/second || kill -9 $1
    done

    And even that may not be clever enough for your particular situation,
    so test and tread carefully. You may even need to do math on the time
    field to see what has really happened.

    Bela Lubkin made an interesting post about an apparently slow CPU2 on
    an SMP system. Read it at http://aplawrence.com//Bofcusm/1695.html.

    Another thing you may see is a process that has used a lot of time
    but isn't gaining time right now. I've seen that many times where the
    process is "deliver"- MMDF's mail delivery agent on SCO systems that
    aren't running sendmail. What happens is that for whatever reason
    (a root.lock file from a crash in /usr/spool/mail or a missing "sys"
    home directory), there are thousands of undelivered messages in the
    subdirectories of /usr/spool/mmdf/lock/home

    The fix for that is simple if you don't care about the messages: rm -r
    all those directories and recreate them empty with the same ownership
    and permissions

    cd /usr/spool/mmdf/lock/home
    /etc/rc2.d/P86mmdf stop
    rm -r *
    chown mmdf:mmdf *
    chmod 777 *
    cd /usr/spool/mail
    rm *.lock
    /etc/rc2.d/P86mmdf start

    You'd then want to verify that mail is working normally and that whatever
    caused the problem isn't still happening- for example, if /usr/sys is
    missing this problem will come right back again very quickly.

    Another possibility is a program that is rapidly spawning off other
    programs. You should be able to see that in "ps -e". First, are the
    number of processes growing?:

    ps -e | wc -l
    sleep 5
    ps -e | wc -l

    Or, are there new processes briefly showing up at the end of the listing?:

    ps -e | tail
    sleep 5
    ps -e | tail

    In either case, you need to track down the parent and kill it.

    -- 
    tony@aplawrence.com Unix/Linux/Mac OS X  resources: http://aplawrence.com
    Get paid for writing about tech: http://aplawrence.com/publish.html
    

  • Next message: Jeff Liebermann: "Re: Sar questions"

    Relevant Pages

    • Re: C# Threading, and suspending or killing a thread
      ... Other threads that are spawned like this go right back to zero CPU, ... the Process Explorer to kill it there. ... When you terminate a thread in a .NET application you should consider the whole appdomain doomed, as in, you should no longer keep it around. ... - deadlocks, if the thread locks internal resources it will never unlock them, thus blocking future threads from ever accessing the resources the lock protects ...
      (microsoft.public.dotnet.languages.csharp)
    • Re: how to stop a running process
      ... I am finding that my code somehow does not let the main ... Perhaps your thread is hogging the CPU. ... select the process and kill it. ... remove the underscores from my email address (and please indicate which newsgroup and message). ...
      (microsoft.public.pocketpc.developer)
    • checking if a job still uses the CPU
      ... I want to check a job periodically whether it still uses the CPU. ... field: cutime) regularly. ... I kill jobs that are idling and have not used the CPU in $DELAY ... send the line "unsubscribe linux-kernel" in ...
      (Linux-Kernel)
    • Re: how to stop a running process
      ... >>I am a newbie. ... > Perhaps your thread is hogging the CPU. ... select the process and kill it. ... remove the underscores from my email address (and please indicate which newsgroup and message). ...
      (microsoft.public.pocketpc.developer)
    • Re: Re-entrancy???
      ... Sleep calls will just slow the application down. ... Whether or not PeekMessage is occuring in DoEvents ... has no impact on the overall CPU utilization. ... correctly, using the timeXXXX-APIs. ...
      (microsoft.public.vb.general.discussion)