Re: Hot Swapping CPUs?

From: Dan Foster (dsf_at_globalcrossing.net)
Date: 01/03/04

  • Next message: Doug Barton: "Re: kernel with bge doesn't compile (undefined reference to `vlan_tag_alloc')"
    Date: Sat, 3 Jan 2004 01:03:07 +0000
    To: Scott W <wegster@mindcore.net>
    
    

    Hot Diggety! Scott W was rumored to have written:
    >
    > Some of the higher end IBM x86 systems are supposed to be able to do
    > this, although note that they are all systems equipped with integrated
    > (or additional) service processors (AKA Remote Supervisor Adapters).
    > Some of the service processor setups can be accessed via serial or rs485
    > management ports, and their monitors(CPU, Mem, disk status, temps, fans,
    > voltages) are monitored as well via IBM Director (software).

    I should point out that you can gather that data today with an utility
    called 'xmbmon' (or a few other similar tools) that gathers that
    information via the SMBus if the utility knows how to talk with the
    motherboard chipset and especially if it has sensors (LM75, LM78, etc).

    Modern motherboards -- things made in the past 2-3 years at least, tend to
    be capable of talking with xmbmon. I've got it working great -- I wrote a
    Nagios plug-in that interrogates the data (temp, fan, power, etc) from the
    motherboard and if the Nagios server detects that one parameter is outside
    an acceptable range, it raises an alarm. (How you want the system to
    respond to an alarm is customizable, too... we like for ours to shut down
    upon an high temp or voltage out of range alarm.)

    The typical situations that would cause an alarm is: a fan fails, causing
    internal temp to climb... OR... the HVAC system power quits, causing room
    temp to spike through the roof. Either way, we want the system to quickly
    sanely shut down to prevent stuff melting down like silicon which is much
    more time consuming to recover from. (Imagine -- it dies on a Sunday night;
    you don't have a spare CPU/RAM/motherboard/HD/etc... what do you do?)

    This is what happened during the large blackout on the U.S. East Coast last
    year... the systems stayed on because the room was on an industrial UPS...
    BUT... the HVAC system was not, so the room temp went to 125 degrees
    Fahrenheit or hotter.

    http://www.nt.phys.kyushu-u.ac.jp/shimizu/download/xmbmon200.tar.gz

    (2.03 is also in /usr/ports/sysutils/xmbmon)

    For our 4.9-RELEASE production boxes, we compiled mbmon from that package
    (not the xmbmon portion; we don't need the X interface) and added to the
    kernel config file:

    device smbus
    device iicbus
    device iicbb
    device viapm
    device smb

    (Note: smb above relates to SMBus, not SMB like the Windows file stuff :)
    SMBus = System Management Bus.)

    Built kernel, rebooted, mbmon worked great out of the box. Eg:

    # mbmon -c1

    Temp.= 31.2, 33.0, 24.2; Rot.= 3590, 0, 0
    Vcore = 1.75, 1.18; Volt. = 3.33, 5.20, 11.95, 0.00, 0.00

    (Not all motherboards will reports all parameters; we have other servers
    with a different motherboards that reports different numbers...)

    I just wanted to mention all of the above stuff because some folks asked
    about temp/fan/voltage monitoring -- existing tools can already do that.

    However, for monitoring the *failure* of a CPU (for other than a temp or
    voltage issue) is a much more interesting issue that I don't think mbmon is
    really geared to deal with.

    IBM POWER4 servers do so through a service processor that actively snoops
    all CPU and memory transactions to determine when a CPU has died, and then
    takes a failed CPU out of service along with generating error reports for
    immediate notification. The way it's done results in continued uptime,
    which is why it's done that way. I don't know how the high end x86 servers
    handles that but the expensive servers from Compaq, years ago, had some
    sort of similar features for the CPU.

    I just don't know how one would interface with the hardware to obtain
    information... likely to be proprietary or hardware-specific since I don't
    think there's a standard across vendors for this.

    -Dan
    _______________________________________________
    freebsd-current@freebsd.org mailing list
    http://lists.freebsd.org/mailman/listinfo/freebsd-current
    To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"


  • Next message: Doug Barton: "Re: kernel with bge doesn't compile (undefined reference to `vlan_tag_alloc')"

    Relevant Pages

    • Re: FX 60 HSF and Case Recommendation
      ... cpu temps. ... Antec minitower (front and back 120 fans and one side 90 fan) ... One trick to bringing the case temp down, ... number might represent a typical cheap heatsink. ...
      (alt.comp.periphs.mainboard.asus)
    • Re: Whats an acceptable temp increase when overclocking?
      ... that it is the likely source of the BIOS and Probe CPU temp readings. ... other temperature sensors. ... DTS measures relative temperature - a reading of -20C from the sensor ...
      (alt.comp.periphs.mainboard.asus)
    • Re: Whats an acceptable temp increase when overclocking?
      ... that it is the likely source of the BIOS and Probe CPU temp readings. ... other temperature sensors. ... DTS measures relative temperature - a reading of -20C from the sensor ...
      (alt.comp.periphs.mainboard.asus)
    • Re: System lock ups possibly due to heat
      ... Do a Google for Rightmark CPU Clock Utility ... My core 2 duo with an aftermarket Thermalright Cooler and 120mm fan never goes above 55 running stress test for hours. ... The room temp also comes into play as the Northbridge could be getting too hot and causing your shutdown...try taking the ... Your CPU temperature under load seems high to me, so consider removing, then ...
      (microsoft.public.windows.vista.hardware_devices)
    • Re: computer keeps crashing
      ... "fox one" and noticed that cpu was slowly rising from 48C to 58C. ... Since I couldnt tell whether it was the mem sticks or the cpu temp ... but now Im begining to wonder if this cooler is sufficient. ... the reviews to see what kind of performance the customers got. ...
      (alt.comp.hardware.pc-homebuilt)