Re: Solaris 10 - catman weird error
From: Bryan Cantrill (bmc_at_kiowa.eng.sun.com)
Date: 12/07/03
- Next message: Brendan Gregg: "A tool to "snoop" I/O traffic"
- Previous message: Gavin Maltby: "Re: Setting 64k page size as default for _all_ applications..."
- In reply to: David Williams: "Re: Solaris 10 - catman weird error"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Date: Sun, 7 Dec 2003 01:54:49 +0000 (UTC)
In article <bqj9j8$i8t$1@newsg2.svr.pol.co.uk>, David Williams wrote:
>
>"Bart Smaalders" <barts@smaalders.net> wrote in message
>news:5bd8e3f5.0311271128.77c6c550@posting.google.com...
>>
>> Gratuitous DTrace plug:
>>
>> Since you're running a DTrace capable build, here's a handy script for
>> figuring out which system calls are chewing up the most cpu...
>
> Question for you..
>
> Suppose I have a database server which occasionaly gets errno 5 (I/O
>error) or errno 9 (EBADF)
> returned from read/write/open calls but this is so intermittent that it
>cannot be reproduced.
> When uses use the system this error shows up in a few hours but checked
>the database for
> errors fixes things (with no errno 5/ errno 9 errors).
>
> Clearly running something like truss on the database server for all users
>will be far too slow.
>
> Could dtrace help? Something like if read/write/open returns I/O error or
>EBADF then
> log it (like the inode with the error, the offset etc.).
DTrace can absolutely help in this situation. For starters, you could simply
log the executable name and a user-level stack trace whenever a read or write
returns EIO or EBADF:
---8<--- logerrs.d ---8<---
#!/usr/sbin/dtrace -s
#pragma D option quiet
syscall::read:return,
syscall::write:return
/errno == EIO || errno == EBADF/
{
printf("%s returning errno %d with stack:\n", execname, errno);
ustack();
printf("\n");
}
---8<--- logerrs.d ---8<---
Of course, these are very different errors -- EBADF is almost certainly
due to an application bug, but EIO could be due to lots of different
problems.
Let's deal with the EBADFs in more detail: if you're getting EBADFs, you
probably want to know _how_ exactly the file descriptor was closed.
(Assuming, of course, that the EBADFs are failing because a file descriptor
is being accidentally closed, and not because someone is passing the
return value of lrand48() as the first argument to read/write.) The problem
of course is that the file descriptor was closed sometime in the past, and
the information about the close is only relevant to you if someone else
attempts a read or write on the closed file descriptor. This is exactly
the kind of situation for which we invented speculative tracing. Assuming
that you know the process ID of the problematic process, here is a D script
that will display the stack trace of both the close and the read/write:
---8<--- whoclosed.d ---8<---
#!/usr/sbin/dtrace -s
#pragma D option specsize=1k
#pragma D option nspec=8192
#pragma D option quiet
int fds[int];
syscall::close:entry
/pid == $1 && fds[arg0]/
{
/*
* We already have a stack trace corresponding to the close; we
* want to discard it.
*/
discard(fds[arg0]);
fds[arg0] = 0;
}
syscall::close:entry
/pid == $1/
{
/*
* Begin a new speculation, and speculatively trace the closing stack
* trace. Store the new speculation identifier in an associative
* array indexed by the file descriptor.
*/
fds[arg0] = speculation();
speculate(fds[arg0]);
printf("fd %d was closed with stack:\n", arg0);
ustack(100);
printf("\n");
}
syscall::read:entry,
syscall::write:entry
/pid == $1/
{
self->fd = arg0;
}
syscall::read:return,
syscall::write:return
/self->fd && errno == EBADF/
{
/*
* A read or write returned EBADF; record the user stack.
*/
printf("%s failed on fd %d with stack:\n", probefunc, self->fd);
ustack();
printf("\n");
}
syscall::read:return,
syscall::write:return
/self->fd && errno == EBADF && fds[self->fd]/
{
/*
* We have a speculatively traced user stack for the close of this
* file descriptor; commit it.
*/
commit(fds[self->fd]);
fds[self->fd] = 0;
}
syscall::read:return,
syscall::write:return
/self->fd/
{
self->fd = 0;
}
---8<--- whoclosed.d ---8<---
(This script is going to be a bit difficult to understand if you haven't
seen DTrace before; check out http://www.sun.com/bigadmin/content/dtrace
for the AnswerBook, and in particular read the "Variables", "Speculative
Tracing", and "Options and Tunables" chapters.)
Running the above, specifying a (contrived) process that closes a file
descriptor before writing to it:
# ./whoclosed.d `pgrep badwrite`
write failed on fd 10 with stack:
libc.so.1`_write+0x8
badwrite`main+0x40
badwrite`_start+0x108
fd 10 was closed with stack:
libc.so.1`_close+0x4
badwrite`main+0x2c
badwrite`_start+0x108
^C
Note that this script is more expensive than the simple logging script
(that is, it has a larger probe effect), but it will tell you _exactly_
who closed the file descriptor -- and it's much, much, much less invasive
than truss(1). And because it only provides exactly the output that
you need, you can use it for long-running processes. You also may
be interested in stopping a process when it is returned EBADF; see the
"Destructive Actions" section of the "Actions and Subroutines" chapter
of the AnswerBook.
- Bryan
----------------------------------------------------------------------------
Bryan Cantrill, Solaris Kernel Development. bmc@eng.sun.com (650) 786-3652
- Next message: Brendan Gregg: "A tool to "snoop" I/O traffic"
- Previous message: Gavin Maltby: "Re: Setting 64k page size as default for _all_ applications..."
- In reply to: David Williams: "Re: Solaris 10 - catman weird error"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Relevant Pages
|