Re: Interpreting program core dump in mdb



In article <13un7sf5fhq685a@xxxxxxxxxxxxxxxxxx>, "Mr. Uh Clem" <uhclem@xxxxxxxxxxxxxxxxxx> wrote:
At $DAY_JOB, we've got a customer who has installed our product on a
Solaris 10 Sparc system and is getting a mysterious segment violation in
one of our background processes. Of course, this problem does not occur
on any of our inhouse systems.

We did get the customer to send us a core file, but aren't very handy
with the debug tools on Solaris.


# mdb prog core
Loading modules: [ libc.so.1 ld.so.1 ]
::stack
strncpy+0x5d0(20, 7182f4, 1b, 726f6f74, 0, 20)
secure+0x1b8(2e4088, b1978, c6068, 1f, 717298, 0)
process_request+0x41c(2e7d8, 1, c60e4, 1, 5750bc, 0)
open_socket+0x310(0, c8bf0, 5, 7efefeff, 81010100, ffbff9bc)
main+0x664(1, ffbffc1c, ffbffc24, c6000, c80fc, 3)
_start+0x108(0, 0, 0, 0, 0, 0)


I've googled up countless articles telling me that ::stack gets a
stack dump, but have yet to find one which tells me what the
values in the display **ARE**.


Some specifics on this one: It's a daemon process which accepts
a connection and forks off a worker process to handle the connection.
Early on, it calls secure() which is linked from a different .o file:


char user_name[USER_LENGTH + 1]; /* global in .c containing secure */


secure(host)
char *host;
{
....
struct passwd *pw;
....

pw = getpwuid(getuid());
if (pw != NULL)
strncpy(user_name, pw->pw_name, sizeof(user_name)-1);


We seem to blow up on trying to move the user name from pw->pw_name,
which is very strange given that pw is supposed to point to static
space allocated by getpwuid().

struct passwd {
char *pw_name;
char *pw_passwd;
uid_t pw_uid;
gid_t pw_gid;
char *pw_age;
char *pw_comment;
char *pw_gecos;
char *pw_dir;
char *pw_shell;
};


Understanding the context around the stack frame seems really
crucial. One thing that is really strange is that
strncpy+0x5d0(20, 7182f4, 1b, 726f6f74, 0, 20)
contains r o o t which should be in
memory at the address pointed to by pw_name...


We're pretty sure we're doing Something Stupid(tm), but don't see
how we could muck up the static space returned by getpwuid between
the time the program starts and getting to this point. This is
code that has been running for quite a while on various Unix flavors
including Solaris 7 and upward. We now see that we have two
Solaris 10 customers with this problem. The code was compiled
under a Solaris 8 system.

So anyway, some pointers to interpreting the context around a crash
using mdb would be appreciated.

TIA

It would help if you built with debug enabled, which is a -g parameter.

Eric
.



Relevant Pages