Re: FreeBSD Machines dieing, we've tried so much....

From: Matt Juszczak (matt_at_atopia.net)
Date: 06/22/05

  • Next message: Robert Huff: "clamav build link error (reference to gethostbyname_r)"
    Date: Wed, 22 Jun 2005 11:59:12 -0400
    To: Ted Mittelstaedt <tedm@toybox.placo.com>
    
    

    >The vast majority of panics are hardware-related. It is rare nowadays
    >for a usermode program to make the system panic. In particular you said
    >the problem happens more under load. That really points even more to a
    >hardware problem - bad CPU cache ram, bad ram, scsi termination, that
    >sort of thing.
    >
    >Ted
    >
    >

    This is kind of going to be a blanket post to all the recent suggestions
    to me. I appreciate suggestions :) Ted, sorry, my other posts had
    dmesg and hardware specs, etc. I just couldn't remember the subject line
    of that thread. I'll be more descriptive here.

    We have two different servers crashing. Both are SMP, but on different
    hardware. We have five freeBSD servers in total, and only two are
    affected. That is why I do not believe this is a hardware problem.

    In any case, the machines are in a cold room where the temperature is
    constantly maintained. 20 other servers in there are perfectly stable,
    with no probs.

    This particular machine that crashed last night while running portsdb
    -uU is a Super Micro machine, with hyperthreading disabled in the bios,
    dual CPU 3.06 ghz, with 4 gigs memory. We ran mem test on orion (the
    machine that crashed last night) a week or so ago, and it found 70,000
    ECC errors. Those were fixed and that machine has been stable until
    last night. I've now disabled SMP support, we'll see if that keeps it
    stable or not. Portsdb -uU ran without problems after I disabled SMP.

    As far as uranus, the other box (we keep a planet scheme for a certain
    set of servers), we ran memtest86 and found no errors at all. That box
    crashed about two days ago but has been stable since. It has not lasted
    more than a week without doing a kernel trap and freezing.

    It seems that both these servers have this problem. Out of the five
    FreeBSD servers we have, these two are the ones with the highest load.
    Maybe a higher load on the other three servers would cause the same
    problem. I agree with you that this is a hardware problem, but on more
    than one server with two different architectures and our highest load
    makes me re-consider.

    If this is truly a bug in FreeBSD 5.4-RELEASE, maybe this is something
    that has been fixed in -stable? I will compile a debug kernel today and
    try to provide a trace to the problem. I'll do it on which ever server
    crashes next.
    _______________________________________________
    freebsd-questions@freebsd.org mailing list
    http://lists.freebsd.org/mailman/listinfo/freebsd-questions
    To unsubscribe, send any mail to "freebsd-questions-unsubscribe@freebsd.org"


  • Next message: Robert Huff: "clamav build link error (reference to gethostbyname_r)"

    Relevant Pages

    • RE: Mysterious reboot
      ... hardware in a very long time. ... filesystems weren't unmounted correctly upon reboot, ... No crashes. ... CPU load was over 8 at some points. ...
      (freebsd-questions)
    • Re: Anybody Use 2 or More CPU at Production Env. ( SMP )
      ... It could be your hardware, you need to debug your kernel. ... We have been using dual CPU in all our servers since FreeBSD 3.x, ... works just fine under any load. ... servers run many WWW services and serve millions of requests a day. ...
      (freebsd-questions)
    • Re: FreeBSD Machines dieing, weve tried so much....
      ... >> the problem happens more under load. ... > other posts had dmesg and hardware specs, ... > We have two different servers crashing. ... > on which ever server crashes next. ...
      (freebsd-questions)
    • Re: Anybody Use 2 or More CPU at Production Env. ( SMP )
      ... It could be your hardware, you need to debug your kernel. ... We have been using dual CPU in all our servers since FreeBSD 3.x, ... works just fine under any load. ... servers run many WWW services and serve millions of requests a day. ...
      (freebsd-isp)
    • Re: Windows 2003 Service Pack 2 Active Directory Upgrade
      ... In my opinion you should document yourself with your hardware vendor/support ... servers. ... so that you don't install the SP2 during office hours if something ... installation of the service pack. ...
      (microsoft.public.windows.server.active_directory)