Re: physical drive replacement
Date: 25 Jun 03 06:53:15 PST
In article <1030624224635.2835D-100000@Ives.egh.com>,
John Santos <JOHN@egh.com> writes:
> On Tue, 24 Jun 2003, frank brown wrote:
>> I'm getting UNCORRECTABLE ECC errors on DUA12 which is actually a pair of
>> striped RZ29L (SCSI disks) shadowed with an identical 2-volume stripeset.
>> (We have 4 disks, configured as 2 shadowed stripesets.) The drives are
>> attached to an HSD10 DSSI-SCSI controller in a Storageworks BA350 tower
>> connected to a VMS 5.5-2 VAXcluster.
>> My first problem is determining which of the 2 physical drives in stripeset
>> DUA12 is throwing the errors. ANALYZE/ERROR identifies the unit as
>> _RAID1$DUA12. Since DUA12 is a stripset how can I tell which physical drive
>> it is?
>> Once I identify the drive with the errors, I'd like to replace it with a
>> spare. Ideally I'd love to simply dismount the volume, pop the old drive
>> module out of the enclosure, replace the drive inside the module, slide it
>> back into the enclosure, initialize the drive, remount it as part of the
>> shadowset and have the system rebuild the volume. However I realize this is
>> 1. The HSD10 manual says I can warm-swap the drive but I need to 'quiesce
>> the SCSI bus'. Is there a way to perform this operation from the VMS
>> command line or do I need to shutdown VMS to get to >>> console to enter
>> HSD10 commands?
> Frank - since no one else has responded yet (though I only see 36 new
> messages today, so my ISP might be having news server problems), I'll
> throw in my 2 cents.
> I haven't used an HSD10, but on HSZ & HSJ controllers there are a set
> of buttons on the front, one for each bus. You hold down the button
> for a few seconds and it starts flashing. This means the controller
> has noticed the button press and is stalling all I/O to that bus. You
> then have about 30 seconds to make changes. (I'm not sure, but I think
> you can press the button again to resume activity when you are down,
> but the controller will time out after about 30 seconds and resume
> even if you do nothing. While quiesced, the controllor will continue
> to accept I/O requests for drives on the bus, but won't do anything
> about them. To the hosts, it just looks like the disks have gotten
> really slow, but nothing breaks. You probably want to do this when
> the system is relatively idle, just to keep your users from complaining.
>> 2. Will I need to recreate the stripeset at the HSD10 or will the existing
>> stripeset definition work with the replacement drive (since it's the same
>> model in the same slot)?
> Sorry, don't know how HSD's handle stripe sets...
>> 3. Any other thoughts or suggestions on dealing with this situation.
> You say "shadowed stripesets"... HBVS?
You'll want to first disolve the shadowset with a DISMOUNT of the failing
member from VMS.
Then on the controller,
record the SHOW display for unit, stripeset and the failing disk,
DELETE the stripeset's unit,
DELETE the stripeset,
DELETE the disk, quiesce the bus and perform the physical swap,
ADD the disk as seen in your prior SHOW display,
INIT the disk,
ADD the stripeset as seen in your prior SHOW display,
INIT the stripeset,
ADD the unit as seen in your prior SHOW display,
and SET any characteristics that are absent on the unit.
And back in VMS re-form your shadowset with the appropriate MOUNT command.
> My guess, since half the blocks will need to be replaced is that
> the write-logging stuff in later VMS (probably not available in V5.5
> anyway) wouldn't help in this and you'll have to do a full shadow
> copy, but I think you should be able to dismount the bad stripeset
> from the shadowset, if it hasn't already been kicked out (do this
> before pulling the bad drive), replace the broken drive, reconstitute
> the stripe set (Don't know how to do this), init the reconstituted
> DUA12: with the same volume label as your original stripe set,
> and mount it into the shadow set, which should trigger a shadow copy
> (not Merge!) to it. (Half the blocks, more or less, should still be
> identical to the source, but the other half will be blank or test
> patterns or old data, so you definitely want it to do a copy.)
> In an hour or so, the copy should complete and you should be all
> No down time for the application, provided you don't mind
> running without the shadow backup you normally have. If really
> paranoid, you can shut down the application, backup the good
> remaining stripeset (the good half of the shadow set), do the
> drive replace and shadow set rebuild, and then turn the application
> back on. This method will result in considerable downtime, probably
> about an hour plus whatever time it takes to backup the good disks and
> to swap out the bad disk and rebuild the stripeset, but you will have
> a good backup at all times.
> Other people recently have discussed the benefits of doing an
> physical backup of a shadow set to a new disk that you want to add
> to the shadow set, in order to reduce the amount of copying that
> the shadow copy needs to do. (I think shadow copying operates in
> a very cautious way. Something like read from the source disk,
> read-check again to verify it reads okay, read from the destination
> disk (looking for a potential bad block), read-check the destination
> disk, compare the source and destination, and if different, write
> to the destination disk, and then writecheck the stuff just written.
> If a bad block or check failure is found anywhere in all this, then
> the bad block replacement process is initiated. I think if you
> backup/physical the source disk to the destination disk first, you
> save the copy from having to do the writes and write-checks, but
> of course the system still has to write all the data while doing
> the backup, and has to read the data an extra time, so I don't
> see how this saves you much, especially if you /verify the backup.
> A Google search should find the thread that discussed this, a few
> months ago.
>> -Frank Brown
>> Seattle Fire Dept.
> Fire Dept? Maybe you want to be paranoid :-)
> John Santos
> Evans Griffiths & Hart, Inc.
> 781-861-0670 ext 539
-- - Jim