Brief on HSG80 SCSI-3 to SCSI-2 reconfiguration

From: Davis, Alan (Davis_at_tessco.com)
Date: 05/12/03

  • Next message: wj27_at_mail.gatech.edu: "Off the self disks in an Alphastation?"
    Date: Mon, 12 May 2003 12:53:34 -0400
    To: "'tru64-unix-managers@ornl.gov'" <tru64-unix-managers@ornl.gov>
    
    

    This is, at this point, probably only good for historical documentation. I
    wrote it up after reconfiguring our SAN to allow both v4.0x and v5.1 servers
    to connect to the same fabric. Reconfiguration the other direction, from
    SCSI-2 to SCSI-3 is much less problematic and is documented in the manuals.

    The saga of a SAN reconfiguration...

    Keeping in mind that we only have 8 UNIX systems to put up on the SAN
    together,
    some of the lessons learned may apply to other's with a similar situation. I
    would hesitate
    to do this with a large server farm.

    The root-cause of this reconfiguration is that the target date for upgrading
    our
    production Oracle Apps servers to v5.1 has come and gone and I needed to be
    able to move forward with the SAN deployment with a mix of v4.0F and v5.1
    systems. The SAN was originally configured with only v5.1 systems attached
    and
    the v4.0F systems were to be upgraded before attaching them. The SAN now had

    to be reconfigured so that both OS's could attach to it.

    The SAN configuration rules, among other things, require that the HSG80's be

    configured in SCSI-2 mode with transparent failover and only one HBA per
    host.
    This meant downgrading the HSG80 from SCSI-3 mode and multi-bus failover.
    Neither of these are covered in any of the manuals or whitepapers.

    I logged a call with Compaq Services, StorageWorks support and was told that
    I should delete all the connections and units prior to changing the HSG80
    settings. Further discussions with remote support and our local Field Circus
    engineer made it clear that upgrading to the latest revisions of all the
    bits and pieces would be very advisable.

    These upgrades consisted of :
    patching 6 v5.1 DS10L's from pk2 to pk3 to get the v1.29 emx driver
    patching 2 v4.0F systems, 1 DS20 and 1 AS4100, from pk5 to pk6 for the emx
    update
    upgrading the firmware on all KGPSA's to 3.18a4
    upgrading the FC SAN switches from 2.17 to 2.19g
    upgrading the HSG80's ACS from v8.5F-1 to v8.6F-1, requires a card change
    patching the HSG80's from v8.6F-1 to v8.6F-2

    Two of the DS10L's were production web servers and needed to be back in
    service as quickly as possible. The rest of the systems could be down for
    several hours on a Sunday without seriously affecting operations.

    The order of the updates may or may not be critical, but it seemed to be
    a good idea to get the OS and KGPSA's updated first, then the switch and
    finally the HSG80's. The switch to SCSI-2 and transparent failover would be
    last.

    The HSG80 8.6F-1 cards were ordered in advance. The KGPSA and FC switch
    firmware were downloaded from the Compaq support website. The KGPSA fw
    was put onto floppy, the FC fw was put on a UNIX system for upload.

    Full backups were made of all the systems prior to beginning the upgrades.

    The OS patches went on easily. The KGPSA firmware was more difficult.
    The readability of the floppy varied from system to system. Several
    DS10L's required numerous tries to be able to read it and one refused
    even after repeated efforts. For this system the fw was burned onto a CD
    and finally loaded. The other problems stemmed from the differences in
    getting into and out of the Alphabios/ARC console on the various
    systems. Using the RCM (remote console) command "reset" was the most
    consistent way of exiting ARC.

    The FC fw file is loaded using rcp and requires that the copy not prompt
    for a password. There must be an entry in the hosts file for FC switch
    and an entry in the $HOME/.rhosts file. I wasn't able to get it to work
    with the "admin" username in the .rhosts file, so only the hostname of
    the switch was used.

    The web interface worked well to upload the new code. A switch reboot is
    required to activate the new fw. This will interrupt access to any disks
    or hosts served solely by that switch. This means that any systems that
    must stay online must have an alternate path to disks on the SAN or a
    non-SAN mirror of the disks.

    Similarly, the HSG80 update requires that at least one of the controllers
    be offline at a time. For this upgrade all systems on the SAN were either
    halted or had non-SAN mirrors. More on that later.

    The replacement procedure for the ACS cards is straightforward and presented
    no surprises.

    Applying the patch to bring the ACS up to -2 was beset with problems. The
    SWCC
    v2.2 seems to have problems with the controller software update process and
    was abandoned in favor of manually entering the patch via the CLCP utility
    on the controller.

    The process is tedious and error-prone, but is easily explained and has
    good error checking. The only problem came from the version verification.
    The
    patch listing printed out the version as V85F. When this was entered into
    the
    CLCP, however, the current version of the card was displayed as (V85F ).
    The CLCP wouldn't load the code due to a version mismatch. The difference is
    easy to see here, but in the loading process isn't nearly as clear. The
    solution was to enter the version as "V85F<sp><sp>". This satisfied CLCP
    and the patch went on cleanly afterwards.

    The final steps were to, at last, reconfigure the HSG80 to achieve the
    ultimate goal, heterogenous OS SAN access.

    The HSG80's were in SCSI-3 and multibus failover mode. It took a number of
    attempts and several controller reboots to find the right combination of
    events :
        Set all port 1 connections to use unit_offsets between 0 and 99
        Set all port 2 connections to use unit_offsets between 100 and 199
        Set this nofailover
        Manually restart other by pressing the restart button
        Set failover copy=this
        
    At this point any units above 99 are only visible on port 2 of the bottom
    HSG80 and units 0-99 are only visible on port 1 of the top controller.
    This will affect which units are accessible to which systems if the switches
    aren't meshed. There are explanations of the different configuration options
    in the Heterogenous SAN Implementation whitepaper.

    If at all possible, shut down all systems attached to the HSG, even if no
    disks are being presented. The switch from SCSI-3 to SCSI-2 affects the
    initialization of the UNIX emx device driver. It doesn't seem to cause
    any dataloss, but any ADVfs domains will panic if an I/O is attempted. LSM
    will not catch the errors until too late. Rebooting the system will bring
    all the disks back online, but it's less alarming to have them boot cleanly
    into SCSI-2 mode.

        Shutdown all systems attached to the HSG80, if possible.
        Set this SCSI-2
        Restart other
        Restart this
        Reboot connected systems.

    The new disks should now be accessible, provided that the unit/LUN naming
    rules are followed.

    One nice surprise was that units that were deleted and re-added at a
    lower unit number retained their WWN and reattached to the host without
    changing their dsk number.

    There were some connections that had to be recreated. Deleting them from
    the HSG and issuing a "hwmgr -scan scsi" or "scsimgr -scan_bus bus=N"
    created new !NEWCONnn connections. The unit_offset was updated for these
    and new, more meaningful, names were given to them. The disks were then
    recognized at by the host.

    The recommendation from Compaq Service to delete all units and connections
    seems to stem from this requirement to reconfigure them. Identifying which
    ones must change isn't difficult and limits the amount of effort to complete
    the reconfiguration.

    Regarding the non-local mirrors used to keep the production systems
    available
    while the SAN was offline : LSM was used to mirror the 18gb of data to the
    H partition of the 30gb internal disk on each DS10L. When the SAN was
    offline,
    LSM continued serving the data from the internal mirror. After the SCSI-3 to

    SCSI-2 configuration change each DS10L was rebooted and LSM automatically
    recovered the DISABLED STALE plexes from the internal mirror. The internal
    mirrors were then removed, leaving the system running strictly on the SAN
    RAID1 production disks.


  • Next message: wj27_at_mail.gatech.edu: "Off the self disks in an Alphastation?"