Re: AlphaServer 1200 powering off by itself.
- From: "Jack Estes" <jack.estes@xxxxxxxxx>
- Date: 15 Aug 2006 10:58:37 -0700
I'll warn you before hand: Reading this post is going to cost you a few
minutes of your life that you'll never get back and if I haven't
helped, I'm sorry. Maybe it will be easier knowing that it took me
15-20 minutes to type this so I've lost more than you ;-)
FreeBSD 5.4 on the 1200 is usually a perfectly acceptable combination
and like you alluded to with your single CPU host, "just works". You're
thermal hypothesis is a good one since the problems seem to occur after
events which would increase heat and resolve themselves after heat
dissipates. Do you ever get any messages on the OCP besides the
I won't wax too philosophic but while the simplest explainations are
usually correct, Holmes would remind us when that ISN'T the case, you
have to eliminate the impossible, so that whatever remains, however
improbable, must be the truth. I'm saying that while the simple answer
might be "heat", the answer doesn't really subtract much from the
differential diagnosis. In short, you might have to poke around a bit
to fix this one. The obvious questions like "did you check to see if
the fans work" or "is hot air coming out of the back of the machine" or
"do you smell burning insulation" I'm going to assume you've already
asked and answered. Did you do a "cat el" from the SRM console to see
if any errors are trapped there?
My immediate thought would be that thermal expansion of some component
is causing electrical contact loss or a short and when it cools,
contact is reestablished but you already knew that. Let's look a little
further to rule out the goofy stuff:
Since FreeBSD wasn't really written for the hardware but rather ported,
there isn't the granularity in error reporting compared to Tru64 or VMS
(at the O/S level, I mean) so the problem is going to be a little
harder to track down. With Tru64, as you probably know, we have stuff
like the binary error log and DECevent, hwmgr, webes, etc, etc, etc to
find hardware faults. By any chance, do you see anything funky in
/var/adm/messages or your syslog from around the times the box conks
out? I'm guessing no, since the fault probably occurs and kills power
before there's time to write anything to disk. Stuff like SCSI bus
resets or I/O errors are where I'm thinking because if the disk
controller or the disks themselves are supplied with erratic power,
they'll behave badly. Too bad electrons move so darn fast ;-)
I did go through a similar experience with a 2100 configured with 2
sable (EV5) cpus, and 2Gb of RAM, but with Debian Sarge, not FreeBSD.
The problem wound up being a faulty connector on the secondary cooling
fan and thermal sensor which was fixed with $0.0005 worth of solder and
10 minutes waiting for my iron to heat up. The symptoms were nearly
identical to those you describe but we're also talking about apples and
oranges because the 2100 is older and the hardware is much different so
I'll try to stay on topic.
Given the machine you have with a couple of EV6 cpus and not a ton of
memory, there aren't a lot of parts you're going to have to pull and
check but you mention you have seven 18GB drives in the machine. What
specific make and model controller do you have the cage hooked up to?
Those drives need a lot of power to startup so the controller usually
spins up just a couple at a time during boot but once they're running,
load isn't too bad. Is this a recently configured machine or has it
been running a while and you're just now having problems with it? If
the former, we need to look at software too, if the latter, probably
it's some hardware thing. But you knew that too.
Next, what SRM firmware revision are you running? AND...if you've
updated it yourself since you acquired the box, do you recall if ARC
and any options were updated too? Was this machine EVER setup to run NT
(unlikely)? I know this sounds like I'm fishing and just like to read
my own Usenet posts but I'm going somewhere with this.
My gut tells me that you have a power supply with a faulty rectifier
circuit so when power requirements for I/O and the CPUs increase, the
voltage regulator overloads. Next on my list, you maybe have a
firmware/driver/OS combination that's allowing misinterpretation of
sensor data and triggering a hardware-initiated shutdown (we used to
call that a "scram" when referring to the shiny-blinky things that
broke regularly in the jet I flew in the Navy). OR, you have a failing
CPU or slot. There's nothing wrong with running FreeBSD on the 1200 but
with tru64 and VMS, the O/S and SRM communicate more and the O/S can do
stuff to "throttle" load related problems. The OLAR stuff for newer
Alpha hardware and Tru64 comes to mind but that won't help us here. If
you had hwmgr, I could ask you to see if any components were being
indicted but my FreeBSD is so rusty I don't know if there's an
Have you tried pulling the CPUs and testing each CPU by itself by
locating it in slot 0 and running the machine? The last recommendation
I have comes from something I saw once on an ES40 that I just simply
couldn't believe. Do a nice clean shutdown of the machine if it's
running, disconnect the power cable(s) and pull the power supply. Blow
it out with compressed something (preferably not freon or hairspray,
etc) , blow out the chassis cage, then clean the contacts of both the
supply and the chassis connector slot with contact cleaner spray or
isopropyl alcohol, dry, and reseat the supply then powerup. It's kind
of one of those "can't see the forest for the trees" solutions but
after seeing our HP field service guy replace the motherboard, the PCI
backplane, two CPUs, the memory daughtercard, and a couple of DIMMs
before investing 30 seconds and two cents worth of air to fix the
hardware, I can't dismiss the solution.
Check, clean, and reseat the Molex connectors to the fans and the power
connector to the drive cage.
Good luck and if you get it resolved, please post your fix here or to
ITRC. Nice sig, by the way.
james <at> hal-pc.org wrote:
no. it's running FreeBSD 5.4, however it will power down if left at
the SRM for a couple of days too. i get "Powerup Failure" on the LCD
after it's cooled down (30+ minutes) and decides to boot again, but i'm
not quite sure what message logs you're referring to. does the SRM keep
logs? if so how may i access them?
sorry if this is OT, but i didn't see any groups just for DEC hardware.
i have an identical system, but single CPU, running a mirror image of
the OS (dd from one to another) that "just works", however if it's been
running for several days (3 or so) and i reboot it will often times
power off before loading the graphic console as well and won't power up
properly until it has sat for a while. that one has yet to power down
when running and it's been up for 6-8 months at a time.. i tossed
another 1200 because it wouldn't stay on for longer than 24 hrs before
powering down (that one was running CentOS). seems to be a problem with
the 1200 but i haven't been able to come up with any pointers googling.
My 2 AlphaPC 164lx machines work great (both of which have FBSD 5.4 on
any other ideas?
- - james <at> hal-pc.org - -
"No freeman shall ever be debarred the use of arms."
- Thomas Jefferson: Draft Virginia Constitution, 1776.
- - - - - - - - - - - - - - -
- Prev by Date: Re: AlphaServer 1200 powering off by itself. - slightly offtopic
- Next by Date: Re: AlphaServer 1200 powering off by itself. - slightly offtopic
- Previous by thread: Re: AlphaServer 1200 powering off by itself. - slightly offtopic
- Next by thread: Re: AlphaServer 1200 powering off by itself.