Re: Sar questions
From: Tony Lawrence (apl_at_shell01.TheWorld.com)
Date: 12/24/03
- Previous message: Jeff Liebermann: "Re: Sar questions"
- In reply to: Chalawal Maliwan: "Sar questions"
- Next in thread: Jeff Liebermann: "Re: Sar questions"
- Reply: Jeff Liebermann: "Re: Sar questions"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Date: Wed, 24 Dec 2003 18:13:47 +0000 (UTC)
Chalawal Maliwan <chalawal@hotmail.com> wrote:
>Below is my sar output. How can we know that what caused the cpu to
>consume 100% CPU at all time?
>#sar
>00:00:00 %usr %sys %wio %idle (-u)
>01:00:00 25 75 0 0
>02:00:00 25 75 0 0
>03:00:00 24 76 0 0
>04:00:00 24 76 0 0
>05:00:00 24 76 0 0
>06:00:00 24 76 0 0
Usually fairly simply. From http://aplawrence.com/Unixart/slow.html
If it is the cpu that is pegged busy, it *may* be a run away process
that is eating cpu cycles. Do this:
for x in 1 2 3 4 5
do
ps -e | sort -r +2 | head -5
echo "==="
sleep 5
done
Look for a process who's time column has gone up by 3 to 5 seconds each
time- if you have something like that, that's your problem- you need to
kill it. The TIME column is time on the cpu- normally a process doesn't
spend a great deal of time actually running- it's waiting for the disk,
waiting for you to type something, etc. Most processes spend most of
their time sleeping, waiting for something else to happen, so something
that gains 3 seconds or more in 5 seconds of wall time is usually suspect.
If you watch it over a few minutes, the time it gains here divided by
the elapsed wall clock time is the percentage of your cpu this process
is taking for itself. A shortlived process can take a lot of the cpu
to print, or to redraw an X screen etc., so you have to use some good
judgement here. But 3 seconds out of 5 is very likely a real problem.
Of course you need to understand what you are killing: you probably
wouldn't want to kill the main Oracle database, for example.
If you kill the errant process and another copy of it pops right back
to the top of the list, then you need to track down its parent:
# for example, if process 15246 is the problem
ps -p 15246 -o ppid
Of course, it may go further up the chain. Here's a script that traces
back to init:
# This works on SCO or Linux, just pass a process ID as an argument.
MYPROC=$1
NEXTPROC=$MYPROC
while [ $NEXTPROC != 0 ]
do
ps -lp $NEXTPROC
MYPROC=$NEXTPROC
NEXTPROC=`ps -p $MYPROC -o "ppid=" `
done
Sometimes you'll have a badly written network program that starts sucking
resources when its client dies. If you can't get the supplier to fix it,
you may want to write a script to track down and kill these things. One
clue that might help: the difference between a good "xyz" process and a
bad one might just be whether or not it has an attached tty. So, if you
see this:
5821 ? 00:00:42 xyz
6689 ttyp0 00:00:08 xyz
7654 ttyp1 00:00:12 xyz
It's probably the one with a "?" that will start accumulating time. So
a script that watched for and killed those might look like this:
set -f
# turn off shell expansion because of "?"
ps -e | grep "xyz$" | while read line
do
set $line
[ "$2" = "?" ] && kill -9 $1
done
If you can't do it that way, you have to get more clever, and watch for
changing time:
set -f
mkdir /tmp/mystuff
ps -e | grep "xyz$" | while read line
do
set $line
ps -p $1 > /tmp/mystuff/first
sleep 5
#adjust sleep as necessary
ps -p $1 > /tmp/mystuff/second
diff /tmp/mystuff/first /tmp/mystuff/second || kill -9 $1
done
And even that may not be clever enough for your particular situation,
so test and tread carefully. You may even need to do math on the time
field to see what has really happened.
Bela Lubkin made an interesting post about an apparently slow CPU2 on
an SMP system. Read it at http://aplawrence.com//Bofcusm/1695.html.
Another thing you may see is a process that has used a lot of time
but isn't gaining time right now. I've seen that many times where the
process is "deliver"- MMDF's mail delivery agent on SCO systems that
aren't running sendmail. What happens is that for whatever reason
(a root.lock file from a crash in /usr/spool/mail or a missing "sys"
home directory), there are thousands of undelivered messages in the
subdirectories of /usr/spool/mmdf/lock/home
The fix for that is simple if you don't care about the messages: rm -r
all those directories and recreate them empty with the same ownership
and permissions
cd /usr/spool/mmdf/lock/home
/etc/rc2.d/P86mmdf stop
rm -r *
chown mmdf:mmdf *
chmod 777 *
cd /usr/spool/mail
rm *.lock
/etc/rc2.d/P86mmdf start
You'd then want to verify that mail is working normally and that whatever
caused the problem isn't still happening- for example, if /usr/sys is
missing this problem will come right back again very quickly.
Another possibility is a program that is rapidly spawning off other
programs. You should be able to see that in "ps -e". First, are the
number of processes growing?:
ps -e | wc -l
sleep 5
ps -e | wc -l
Or, are there new processes briefly showing up at the end of the listing?:
ps -e | tail
sleep 5
ps -e | tail
In either case, you need to track down the parent and kill it.
-- tony@aplawrence.com Unix/Linux/Mac OS X resources: http://aplawrence.com Get paid for writing about tech: http://aplawrence.com/publish.html
- Previous message: Jeff Liebermann: "Re: Sar questions"
- In reply to: Chalawal Maliwan: "Sar questions"
- Next in thread: Jeff Liebermann: "Re: Sar questions"
- Reply: Jeff Liebermann: "Re: Sar questions"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Relevant Pages
|