RE: FreeBSD 4.11 P13 Crash



I've ordered a new CPU and power supply already. After installing those
parts, I hope the problem "goes" away. I would probably bet it's more
likely the power as someone else already mentioned that's a big culprit.

If it still fails after those two changes, then I can consider the
downgrade. I figured my setup can't be that unusual so someone else would
have run into this issue if it was indeed a software bug. Furthermore, I am
biased towards FreeBSD servers. They just aren't buggy beasts by nature!
:)

I don't think it is cooling since the system's temperature is somewhat the
same. I'll take it into consideration though as anything is possible at
this point.

Thanks for the other tips and notes. It's good to have some solid answers!



- Carroll Kong

-----Original Message-----
From: Peter Jeremy [mailto:peterjeremy@xxxxxxxxxxxxxxxx]
Sent: Tuesday, February 28, 2006 1:31 PM
To: Carroll Kong
Cc: hackers@xxxxxxxxxxx
Subject: Re: FreeBSD 4.11 P13 Crash

On Mon, 2006-Feb-27 20:52:57 -0500, Carroll Kong wrote:
Okay this time my kernel was recompiled so there are no
modules to make
it easier to see all of the symbols.

If you cd to your kernel build directory (eg
/usr/src/sys/compile/DAEMON) and run 'make gdbinit' and then
use kgdb in that directory, there are a number of functions
to let you load KLD symbols.

Sometimes the box cycles through the fatal traps 12. Other times it
does not.
...
This box was stable before I upgraded from 4.9->4.11.

It's always possible that you've hit a software bug. Would
it be practical to downgrade to your 4.9 configuration and
see if the problem goes away?
[Note that ths does not totally rule out hardware as the
changed memory footprint may reveal a hardware problem].

I have since swapped the RAM, motherboard, RAM again (I
bought another
stick thinking maybe my new RAM was coincidentally bugged),
one of the
Intel NICs, and my 3Ware controller. The problem still occurred and
actually more frequently. The usual frequency was about 14
days or so.
It just crashed in less than 23 hours and then again within
25 minutes.

Assuming a similar system load[*], this does suggest failing hardware.

My suspicions would be system cooling or PSU. Your P4 should
just throttle back if it gets too warm but other parts of
your system (RAM, northbridge, southbridge etc) may start
mis-behaving if they get too warm.

- PowerSupply (I suppose anything is possible, please note
it is on an
APC UPS, but the power supply might be delivering bad juice?)

I'd put this as the likely culprit - consumer-grade PSUs are
not conservatively rated and modern systems put quite a
strain on the power supplies (in terms of very high dI/dt loads).

year in the past. As a note, the problem is NOT load related. In
fact, one time the fatal panic said the running process was
"idle". :)

[*] A corrupted word in memory can sit around for a
relatively long time before something de-references it. A
lot of packet handing code exists at interrupt level and so
will only trigger when a packet arrives - even if the system
is otherwise idle.

--
Peter Jeremy

_______________________________________________
freebsd-hackers@xxxxxxxxxxx mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscribe@xxxxxxxxxxx"