SUMMARY: TruCluster Member Boot Failure

From: dorth (dorthensensens_at_hotmail.com)
Date: 06/30/03


Date: 30 Jun 2003 08:31:44 -0700

Thanks again to Tom Smith.

We were using MC 1.5 boards where one of them were defective. We
simply swapped in 2 MC2 boards and teh cluster came up no problem at
all.

dorthensensens@hotmail.com (dorth) wrote in message news:<1f0b6ae6.0306261002.dd29f13@posting.google.com>...
> Hi All,
> I really need help with this. I have setup the first node of tru64
> 5.1a cluster without any issue at all. There are to be two nodes only
> so I set up jumpers on the memory channel on node 1 to VH0 and on node
> 2 as VH1. I have also successfully set up a quorum disk. The systems
> are 2 8400's. One is populated with 8x625's and 16GB ram (node 1) the
> other with 4x440's and 5GB ram (node 2).
>
> I run clu_add_member on the first node (after I booted from the node
> specific boot disk) and it goes through the config wizard and tells me
> to boot my member 2 node with boot -file genvmunix <node 2 boot disk>.
> It starts the boot and then after:
>
> Starting CFS daemons
> Registering CFS Services
> Initializing CFSREC ICS Service
> Registering CFSMSFS remote syscall interface
> Registering CMS Services
>
> I get:
>
> cpu 2 halted
> halt code =7
> machine check while in PAL mode
> PC =1c898
>
> cpu 0 not halted
> cpu 1 not halted
> cpu 3 not halted
>
> CPU 02 unexpected machine check through vector 0670
> Processor machine check
>
> I've tried:
> Replacing the PCI shelf.
> Replacing scsi cables on shared buses.
> updating firmware.
> switching pci locations of the memory channel.
> patching the OS (and rebuilding the cluster).
> Replacing memory/cpu/terminator boards.
> praying to all known and unknown religious beings.
> cursing.
>
> I CANNOT get this second node up regardless of my actions. What am I
> doing wrong? What causes this error always at the exact same
> location?
>
> I am now going to try to remove EVERYTHING from the PCI shelf except
> the memory channels and try yet once again.....
>
> Thank you



Relevant Pages

  • Re: How fault tolerant can Linux be?
    ... Can a Linux box be built that would survive the death of a CPU or a ... Death of CPU: there are pretty few machines where the motherboard is ... CPU failure is rare, and machines that support CPU hot-swap are also rare, ... So you could replace them with Linux cluster but you hardly ...
    (comp.os.linux.hardware)
  • Re: The future of CPU based computing, mini clusters.
    ... I am fairly indifferent about process isolation inside a cluster. ... Using a few handfuls of clusters as the "main cpu". ... Your CPU runs the ATI OS code to manage the ATI GPU. ... Seems memory is an issue. ...
    (comp.arch)
  • Re: periods and deadlines in SCHED_DEADLINE
    ... If you want to do G-EDF with limited and different budgets on each CPU ... either 1 cpu or the full cluster. ... A "full cluster" therefore should be created around some memory level. ...
    (Linux-Kernel)
  • Re: [RFC -tip] x86, apic: Merge x2apic code
    ... BTW, as you noticed, x2apic cluster mode allows IPI's to be sent to ... * for now each logical cpu is in its own vector allocation domain. ... +static inline unsigned int x2apic_dest_mode ... unsigned long flags; ...
    (Linux-Kernel)
  • Re: Scaling noise
    ... > replicated in memory once per instance, ... copy from any given cpu, it shouldn't affect the cpu cache. ... Not if you have an SSI cluster, ... > The limited size of a single instance bounds the size of individual ...
    (Linux-Kernel)