SUMMARY: Restoring cluster from tape
- From: Andrew Raine <andrew.raine@xxxxxxxxxxxxxxxxxx>
- Date: Tue, 19 Sep 2006 10:15:16 +0100
Dear All,
I realise that I owe the list a summary...
I had helpful responses from Hasan Atasoy, Christopher Knorr and David G. Hasan, in particular sent me a fantastic document describing the (many and complicated) things one need to do to reconstruct a cluster on new hardware...
I subsequently realised that there is a good section in the "TruCluster Server Handbook" by Fafrak et al. (pub Digital Press) covering exactly the same process.
However, after about a week of effort on my part, I still didn't have a bootable cluster (my fault, not Hasan's or Fafrak et al.) and decided instead to concentrate on migrating all our users, data and applications to the Operton system that we had already bought to replace the alphas.
So, I now have a working Opteron/Linux system, and a decomissioned Tru64 system. I guess there is no-longer any need for me to be subscribed to this list any more, so thanks once again to all of you who have helped in the past: this list is truly one of the best resources on the net!
Regards,
Andrew
PS: Original question:
Hi fellow-managers,
I have a 2-node cluster (DS20 + ES40 + HSG80) running 5.1 pk3
A RAIDset on the HSG died recently, which contained the cluster_root, cluster_usr, cluster_var, root1_domain and root2_domain AdvFS domains AND the quorum disk. (Bad planning, I now realise, but this was how the engineer set us up!)
I have installed a local copy of Tru64 on the ES40, and created a replica RAIDset on the HSG80 to which I have vrestored the relevant filesystems.
Of course the WWIDs of the new partitions are not the same as their original equivalents, so I have had to use wwidmgr at the SRM prompt to enable the new boot devices.
HOWEVER, the system still won't boot. Presumably this is at least partly because the new disks don't have the same /dev/disk/dsk??? names as the old ones. BUT the two nodes behave differently:
Node 1 (the ES40) says:
<hardware self-test stuff deleted...>
CPU 0 booting
(boot dga101.1001.0.6.1 -flags A)
block 0 of dga101.1001.0.6.1 is a valid boot block
reading 13 blocks from dga101.1001.0.6.1
bootstrap code read in
base = 200000, image_start = 0, image_bytes = 1a00
initializing HWRPB at 2000
initializing page table at 3ff7e000
initializing machine state
setting affinity to the primary CPU
jumping to bootstrap code
root partition blocksize must be 8192
can't open osf_boot
halted CPU 0
halt code = 5
HALT instruction executed
PC = 20000030
P00>>>
While node 2 (the DS20) starts to go through what looks like a normal boot sequence, but then says:
alt0 at pci0 slot 9
alt0: DEGPA (1000BaseSX) Gigabit Ethernet Interface, hardware address: 00-60-6D-21-28-64
alt0: Driver Rev = V2.0.2 NUMA, Chip Rev = 6, Firmware Rev = 12.4.12
Created FRU table binary error log packet
kernel console: ace0
i2c: Server Management Hardware Present
dli: configured
NetRAIN configured.
alt0: 1000 Mbps full duplex Link Up via autonegotiation
panic (cpu 0): CNX MGR: Invalid configuration for cluster seq disk
drd: Clean Shutdown
DUMP: Will attempt to compress 93544448 bytes of dump
: into 959315952 bytes of memory.
DUMP: Dump to 0x200005: ....: End 0x200005
succeeded
halted CPU 1
CP - SAVE_TERM routine to be called
CP - SAVE_TERM exited with hlt_req = 1, r0 = 00000000.00000000
halted CPU 0
halt code = 5
HALT instruction executed
PC = fffffc00005e4ec0
(and, in answer to the question already asked before I send this - yes, the two root partitions do have the disklabels set to AdvFS for the first partitions)
Any help/suggestions gratefully received!
How can I predict what the booted system will call the new disks? If I know that I could mount them from my fresh Tru64 install and edit the sysconfigtab and cluster scripts to point to the new disks...
Why can't the ES40 read "osf_boot" on its boot disk? It is there (I've checked) but isn't the first file in the directory listing (which it is on the other boot disk)
Am I barking up the wrong tree here? Should I just give up, and go through a proper re-install and re-creation of the cluster? I'd really rather not, as these systems are due for de-commissioning as soon as our Opteron replacement can be brought online! But that's another story....
Thanks in advance!
Andrew
--
Dr. Andrew Raine, Head of IT, MRC Dunn Human Nutrition Unit,
Wellcome Trust/MRC Building, Hills Road, Cambridge, CB2 2XY, UK
phone: +44 (0)1223 252830 fax: +44 (0)1223 252835
web: www.mrc-dunn.cam.ac.uk email: Andrew.Raine@xxxxxxxxxxxxxxxxxx
- Prev by Date: how to see how many files an app is trying to open?
- Next by Date: maximum number of open files in Tru64 v5.1a
- Previous by thread: how to see how many files an app is trying to open?
- Next by thread: maximum number of open files in Tru64 v5.1a
- Index(es):
Relevant Pages
|