Re: Anyone else having problems with VMS and HSG80-based arrays?

From: Rob Young (young_r_at_encompasserve.org)
Date: 09/11/03


Date: 11 Sep 2003 16:11:58 -0500

In article <5a85bce2.0309111054.6f6fdbac@posting.google.com>, svieth@wi.rr.com (Scott Vieth) writes:
> Hi:
>
> I've got an ES40 running VMS 7.3-1 which is connected to an ESA12000
> (running ACS 8.7S-2).
>
> On July 2nd, we saw a few of the DGAnn: devices go into
> mntverifytimeout. I was forced to reboot the system in an attempt to
> get things running again. The OS came-up okay but when attempted to
> start our IDX software (running on Cache), we found that the Cache
> database files were badly corrupted. After spending a few hours on
> the phone with a support person from IDX and a support person from
> Intersytems, the decision was made to restore from tape. That process
> took pretty much all day (restoring from the last full backup and then
> re-playing the journal files).
>
> We lost an entire day's worth of production on July 2nd. Our clinics
> that use the IDX system for scheduling and other "front desk"
> patient-related activities were dead-in-the-water. We also had
> hundreds and hundreds of "back office" billing people who could do
> nothing that day because our IDX system was unavailable.
>
> I have an open IPMT case with "Storage Engineering". I have sent them
> tons and tons of logs and console output files.
>
> We haven't received any encouraging news, patches or tips on how to
> keep the ESA12000 from going "incommunicado".
>
> Last night, we got hit with the same problem that whacked us on July
> 2nd. Two DGAnn: devices went to "mntverifytimeout". Had to reboot
> the ES40. We got lucky and did not have to restore the IDX
> environment from tape. The odd thing that I noticed is that when I
> tried a "restart this" on one of the HSG80s, the CLI hung on both
> controllers. I had to have an operator hit the buttons on the OCP on
> the front of the HSG80s to restart them.
>
> AND THIS MORNING, we got hit again. Same symptoms. Two DGAnn:
> devices went mntverifytimeout. Had to reboot without shutting down
> Cache. Got lucky once more and the Cache data files were not corrupt
> after reboot. Tried a "restart other" on the HSG80s. The CLI on both
> controllers hung. Had to walk an operator through hitting the buttons
> on the front of the HSG80s to restart them. Does anyone see a pattern
> here?
>

        Yes I see a pattern.

        Why July 2nd? Did you upgrade firmware July 1st? What changed?

        Tips?

        Googling for a dcl snippet, So you can immediately catch when a
        drive goes into MntVerify stick something like this in a loop:

http://groups.google.com/groups?selm=8NOV00.19075364%40feda34.fed.ornl.gov&oe=UTF-8&output=gplain

  $ mntvfy = %x4000
  $ valid = %x0800
  $ loop:
  $ sts = f$getdvi(vol,"sts")
  $ if (sts .and. mntvfy) .eq. mntvfy then ... ! Disk is in MntVerify

                Sound an alarm, and reboot the hung controller.

  $ if (sts .and. valid) .ne. valid then ... ! Disk is in MntVerifyTimeout

                Run in circles, scream and shout

  $ wait 00:01:00
  $ goto loop

---
	Second, raise your MVTIMEOUT to give yourself more time.
	Otherwise you are timing out - HANGING THE DRIVE(s) - and then forced 
        to reboot the node.  (Here is what I have, you may want to bounce 
	this and any advice off HP by the way):
$ mcr sysgen show mv
Parameter Name           Current    Default     Min.      Max.     Unit 
Dynamic
--------------           -------    -------    -------   -------   ---- 
-------
MVTIMEOUT                   36000       3600         1      64000 Seconds    D
	3600 seconds is far too short in my opinion.  You may want to
	adjust shad timeouts if shadowing is in use:
$ mcr sysgen show shadow_mbr
Parameter Name           Current    Default     Min.      Max.     Unit 
Dynamic
--------------           -------    -------    -------   -------   ---- 
-------
SHADOW_MBR_TMO              18000        120         1      65535 Seconds    D
	This way, things will hang on shadowsets also until you intervene.  
	But why the corruption?  That's an interseting question.
> Is anyone else experiencing problems like this with HSG80s connected
> to their VMS systems?  I heard that another big shop here in Milwaukee
> was having similar problems....
> 
> This is just killing us.  The IDX system is one of our most important
> systems.  I can't have the storage going "bye-bye" in the middle of
> the night.
> 
> The only hunch we have about this bug in the ESA12000 is that it seems
> to be related to periods of high I/O activity.  On all three
> occassions, we were hit during the time that our backups run.  We are
> also using controller-based snapshots...
> 
	Oh - skip the shadowing advice above.  But at your version
	of VMS and *if* you had shadowing, and only ONE of the two shadow 
	members were in MntVerify (maybe one on one controller pair another 
	on another controller pair that isn't flaking out at the time) you 
	could kick out the naughty member (maybe you want to bring down
	both controller pairs or are forced to - therefore need to kick
	the members out on the flakey controllers(s)  ):
	$ dismount/force_removal  badboy:
	that is one advantage of having a long MVTIMEOUT , SHADOW_MBR_TMO,
	and using shadowing.
> Is anyone suffering this problem?  
	Not here.  Different kit.  
        Seems like the heavy IO from concurrent operations, controller copies, 
	backups and night jobs is stretching the HSG80s to a breaking point.
				Rob