Re: AlphaServer 1200 powering off by itself.




james <at> hal-pc.org wrote:
Jack Estes wrote:
dissipates. Do you ever get any messages on the OCP besides the
"powerup failure"?

no. well... when it works i get "FreeBSD owns Tux" :) (i set the
variable for it)

to fix this one. The obvious questions like "did you check to see if
the fans work" or "is hot air coming out of the back of the machine" or
"do you smell burning insulation" I'm going to assume you've already
asked and answered. Did you do a "cat el" from the SRM console to see
if any errors are trapped there?

yes, yes, cute, and no
that's a good idea. i didn't know if there was any data that was
available via the SRM about failures and such. i'll certainly check
that tonight.

find hardware faults. By any chance, do you see anything funky in
/var/adm/messages or your syslog from around the times the box conks

nothing unusual.

Given the machine you have with a couple of EV6 cpus and not a ton of
memory, there aren't a lot of parts you're going to have to pull and
check but you mention you have seven 18GB drives in the machine. What
specific make and model controller do you have the cage hooked up to?

from dmesg:
isp0: <Qlogic ISP 1020/1040 PCI SCSI Adapter> port 0x1ffff00-0x1ffffff
mem 0x7def000-0x7deffff at device 0.0 on pci2
isp0: interrupting at IRQ 0x4 (vec 0xbc0)

from pciconf:
isp0@pci2:0:0: class=0x010000 card=0x00000000 chip=0x10201077 rev=0x02
hdr=0x00
vendor = 'QLogic Corporation'
device = 'QLA1020/104x Fast-Wide-SCSI "Fast!SCSI IQ" Host Adapter'
class = mass storage
subclass = SCSI

fairly "stock" AFAIK

load isn't too bad. Is this a recently configured machine or has it
been running a while and you're just now having problems with it? If
the former, we need to look at software too, if the latter, probably
it's some hardware thing. But you knew that too.

it's had the issue since i got it. i don't know when the problem could
have started (may be why they were tossing 20 machines). this OS is a
direct image of my "working" system. i didn't feel like wasting time
re-installing/compiling everything that I already put on my other box so
i booted to single user and "dd if=/dev/da0 of=/dev/da6" then (2 days
later) moved it over to the other and it booted. issues still
remaining, i swapped drives and the "working" machine still works and
the "non-working" still doesn't.

Next, what SRM firmware revision are you running? AND...if you've
updated it yourself since you acquired the box, do you recall if ARC
and any options were updated too? Was this machine EVER setup to run NT
(unlikely)? I know this sounds like I'm fishing and just like to read
my own Usenet posts but I'm going somewhere with this.

i'll check the version tonight. i did update it on all the machines
after a few go-arounds with odd issues like this and them being so
pickey about ram. didn't make any difference with the issues i was
hoping to resolve.

My gut tells me that you have a power supply with a faulty rectifier
circuit so when power requirements for I/O and the CPUs increase, the
voltage regulator overloads. Next on my list, you maybe have a
firmware/driver/OS combination that's allowing misinterpretation of
sensor data and triggering a hardware-initiated shutdown (we used to
call that a "scram" when referring to the shiny-blinky things that
broke regularly in the jet I flew in the Navy). OR, you have a failing

certainly possible! i've learned that it's not so easy to rule things
out. in another post i mentioned plans to swap power supplies. thanks.

CPU or slot.

swapped CPUs and pulled one, with the same results

you had hwmgr, I could ask you to see if any components were being
indicted but my FreeBSD is so rusty I don't know if there's an
equivalent utility.

nothing i know of. there's some simple tools like pciconf that show the
attached devices on the bus and tools for scsi and ata controllers, but
not much in the way of system monitoring outside SMART or ACPI.

Have you tried pulling the CPUs and testing each CPU by itself by
locating it in slot 0 and running the machine? The last recommendation

i haven't tried a /different/ CPU (ie. from another machine), but as
mentioned earlier, neither of the 2 by itself make any difference.

I have comes from something I saw once on an ES40 that I just simply
couldn't believe. Do a nice clean shutdown of the machine if it's
running, disconnect the power cable(s) and pull the power supply. Blow
it out with compressed something (preferably not freon or hairspray,
etc) , blow out the chassis cage, then clean the contacts of both the
supply and the chassis connector slot with contact cleaner spray or
isopropyl alcohol, dry, and reseat the supply then powerup. It's kind
of one of those "can't see the forest for the trees" solutions but
after seeing our HP field service guy replace the motherboard, the PCI
backplane, two CPUs, the memory daughtercard, and a couple of DIMMs
before investing 30 seconds and two cents worth of air to fix the
hardware, I can't dismiss the solution.

good point. i did a pretty good cleaning job when i got them, but i
didn't clean the powersupply connectors with any solution. they were
fairly clean when i got them, just hit it through the drives, mobo,
grill, and powersupplies with an aircompressor.

Check, clean, and reseat the Molex connectors to the fans and the power
connector to the drive cage.

Molex? on the fans? they're attached to the mobo (3 wires). out of
curiosity i did try to power it on one time w/o the fan connected and it
certainly let me know what the problem was :). trying to see if it may
be sticking or something...
tried it with only one drive in there as well, but i'll reseat the Molex
connectors to the drives.

Good luck and if you get it resolved, please post your fix here or to

i'll update with my results. thanks for the very insightful, and
thorough ;), response.

--
- - james <at> hal-pc.org - -
"To insist on strength is not war-mongering.
It is peace-mongering."
- Barry Goldwater
- - - - - - - - - - - - - - -

Jack certainly has some excellent ideas and he's more current on the
Alphas than I (unfortunately) am (stuck with HP-UX nowadays). But with
three of these things conking out in almost the same way, I'm having a
bad time believiing they could all have bad power supplies.

What's the environment like (temperature, humidity) where you have 'em
running?

Charles R. Whealton
Charles Whealton @ pleasedontspam.com

.



Relevant Pages

  • Re: Update on "Question on ASUS A7N8X Deluxe motherboard and compatible processor "
    ... I think I mentioned about the two case fans whose wire were cut - my ... that can measure power consumption for you. ... If the CPU is stable when running at 100% load, ... Disk drives are, for the most part, frictionless. ...
    (alt.comp.periphs.mainboard.asus)
  • Re: AlphaServer 1200 powering off by itself.
    ... i swapped drives and the "working" machine still works and the "non-working" still doesn't. ... in another post i mentioned plans to swap power supplies. ... connector to the drive cage. ... on the fans? ...
    (comp.unix.tru64)
  • Re: removable setting
    ... the SATA specifications for "hot-plugging" call for: ... Power to the SATA HD be through its 15-pin SATA power connector and *not* the Molex connector, and, ... I have to admit that every SATA-IO HD that we've worked with for nearly a year now (including WD, Hitachi, Samsung, and Seagate, all SATA-IO drives) has proven to be hot-pluggable. ...
    (microsoft.public.windowsxp.hardware)
  • Re: Occasional Spontaneous Reboots
    ... 510 watts average or up to 650 watts peak. ... I thought the power was OK) is that I also have a DVD ... fans each have four LEDs. ... front that pull air over the three hard drives, ...
    (microsoft.public.windowsxp.hardware)
  • Upgrade Report [Answer Line: Cut Your PCs Clatter - 08/30/2005]
    ... Today's systems make way too much noise. ... fans keep your system's delicate circuitry cool--an ... anything other than installing drives, lay your PC on its side, with ... plug in the power cord and turn on your PC. ...
    (comp.sys.ibm.pc.hardware.misc)