Re: Errors during shadow set merge



On Feb 21, 8:27 pm, John Santos <j...@xxxxxxx> wrote:
Richard B. Gilbert wrote:
tadamsmar wrote:

On Feb 21, 7:54 am, tadamsmar <tadams...@xxxxxxxxx> wrote:

On Feb 20, 10:19 pm, Michael Austin <maus...@xxxxxxxxxxxxxxxxxx>
wrote:

tadamsmar wrote:

On Feb 18, 11:17 pm, Michael Austin <maus...@xxxxxxxxxxxxxxxxxx>
wrote:

tadamsmarwrote:

On Feb 18, 5:00 pm, "Richard B. Gilbert" <rgilber...@xxxxxxxxxxx>
wrote:

tadamsmarwrote:

I noticed I was getting errors when adding a member to a shadow
set.
I have been getting errors during shadow set merges since I bought
this refurb DS10.
Got 109 error today when I remerged after doing an image.  16
errors
on DKA0 and 93 on DKA100.
What do you think is causing this?
Are these soft errors?
Here is the log for one:
**** V3.4  ********************* ENTRY 1667
********************************
Logging OS                        1. OpenVMS
System Architecture               2. Alpha
OS version                           V7.3-2
Event sequence number         11474.
Timestamp of occurrence              18-FEB-2008 09:52:48
Time since reboot                    77 Day(s) 1:23:46
Host name                            EESD
System Model                         AlphaServer DS10 617 MHz
Entry Type                        1. Device Error
---- Device Profile ----
Unit                                 $1$DKA0
Product Name                         ATLAS10K2-TY184L
Vendor                               QUANTUM
-- Driver Supplied Info -
Device Firmware Revision             DA40
VMSSCSIError Type               5. Extended Sense Data from Device
SCSIID                         x00
SCSILUN                        x00
SCSISUBLUN                     x00
Port Status               x00000001  NORMAL  -  normal successful
completion
SCSICommand Opcode             x28  Read (10 byte command)
Command Data
                               x00
                               x02
                               x06
                               x44
                               x8A
                               x00
                               x00
                               x01
                               x00
SCSIStatus                     x02  Check Condition
Remaining Byte Length            18.
--- Device Sense Data ---
Error Code                      xF0  Current Error
                                    Information Bytes are Valid
Segment #                       x00
Information Byte 3              x02
           Byte 2              x06
           Byte 1              x44
           Byte 0              x8A  LBA:  x0206448A
Sense Key                       x03  Medium Error
Additional Sense Length         x0A
CMD Specific Info Byte 3        x21
                 Byte 2        x23
                 Byte 1        x3E
                 Byte 0        xD4
ASC & ASCQ                    x1100  ASC  =   x0011
                                    ASCQ =   x0000
                                    Unrecovered Read Error
FRU Code                        x00
Sense Key Specific Byte 0       x80  Valid Sense Key Data
                  Byte 1       x00
                  Byte 2       xA0
----- Software Info -----
UCB$x_ERTCNT                     16. Retries Remaining
UCB$x_ERTMAX                     16. Retries Allowable
IRP$Q_IOSB                x0000000000000000
UCB$x_STS                 x08021810  Online
                                    Software Valid
                                    Unload At Dismount
                                    Volume is Valid on the local
node
                                    Unit supports the Extended
Function bit
IRP$L_PID                 x82640450  Requestor "PID"
IRP$x_BOFF                     4416. Byte Page Offset
IRP$x_BCNT                      512. Transfer Size In Byte(s)
UCB$x_ERRCNT                     32. Errors This Unit
UCB$L_OPCNT                22716780. QIO's This Unit
ORB$L_OWNER               x00010004  Owners UIC
UCB$L_DEVCHAR1            x1C4D4008  Directory Structured
                                    File Oriented
                                    Sharable
                                    Available
                                    Mounted
                                    Error Logging
                                    Capable of Input
                                    Capable of Output
                                    Random Access

Is that system under service contract?  If so, ask to have the
drive
replaced!
I hope you have a recent backup that's readable.   If you don't,
try to
make one!  Right now!!!!
It could be just a single bad block.  It could also be all the
warning
you are going to get that the disk is failing!  Once you hear
that "loud
scraping sound" it's all over!!
If you don't have a service contract, order a replacement disk
and get a
rush on the delivery!
Meanwhile, keep an eye on the disk.  If you get more error
messages with
different LBAs it means the situation is deteriorating and you
may have
an emergency within a few minutes or hours.- Hide quoted text -
- Show quoted text -

Are these hard or soft errors?

These are generally HARD errors - do what he said and order a disk
ASAP.- Hide quoted text -

- Show quoted text -

I am skeptical that its the disks (In my original message, I indicated
that I get errors for both disks)

I have had this problem for a while.  I have run:

ANAL/MEDIA/EXER

on the disks and found no errors.

These error bursts only happen when I do a shadow set merge.

I suspect something about the SCSI, or connections, that is stressed
by a merge.

I still suspect the media - and I can back it up with 24 years of
reading error logs... can you?- Hide quoted text -

- Show quoted text -

No.

Here is a log of my recent findings

Merged the shadow set, getting 16 errors on DKA0 and 83 errors on
DKA100.

Did a ANALYZE/MEDIA/EXER=FULL of DKA100 and found 1 bad block.  Got a
good many errors logged during the ANALYZE.

Merged the shadow set, getting 16 errors on DKA0 and 5 errors on
DKA100.

Did a ANALYZE/MEDIA/EXER=FULL of DKA0 and found 0 bad blocks.  Got a
good 0 errors logged during the ANALYZE.

Merged the shadow set, getting 4 errors on DKA0 and 19 errors on
DKA100.

I will swap out one of the disks and give it a try.  Put in a disk
that is logging no errors at its current location.- Hide quoted text -

- Show quoted text -

I swapped the dka100 disks between two DS10s (same model disks).

Then I merged the shadowset on the problem machine.   I got 4 errors
on dka100 and 16 on dka0.  All indicating unrecoverable.

On the other machine a got about 34 errors on dka100 (most indicating
unrecoverable) during the shadowset merge.  But I realized that I had
found 1 bad block on it when it it was on the problem machine using
ANALYZE.  So, I ran ANAL/MEDIA/EXER=FULL and found 13 bad blocks.

I suspect there must be more than bad disks on the problem machine,
since it got 4 unrecovable errrors (at 2 LBAs) on a disk that had none
recently during shadowset merges on the other machine.

BTW,  when you do a DIAGNOSE/TRANS/SUMMARY all these errors
are listed as SCSI errors, but when you look at the sense data in
detailed report, most are identified as medium errors.

I guess will ask the vendor for a couple of disks under the warranty,
but I have no confidence that it will solve the problem.  Maybe I need
the machine replaced.

It is possible that you have a problem with a cable, or a host bus
adapter either as a contributing factor or (less likely) as the whole
problem.

Or it could be a problem with SCSI bus termination, either no terminator
or double termination or an extra terminator in the middle or a bad
terminator.  We had an old BA350(?) shelf (fast/narrow, gray disks)
that we daisychained an external 8mm tape drive to, worked fine for a
year or two, and then someone inserted a tape upside down or backwards
and busted it.  DEC (or maybe Compaq) replaced the drive about 5 times,
they kept failing.  Would work fine in a short test but full volume
backups would usually fail.  After they replaced the drive about 4 or
5 times, someone said the magic word "termination", and so I took the
covers off the BA350 and discovered whoever had added the external
tape drive had neglected to remove the internal terminator.  After
removing it, the latest "DOA" tape drive worked fine.  For some reason
the original drive had no problem with the double termination, but
all the replacements did.

BTW, I thought analyze/media/exer did nothing with disks newer than
about SDI (RAxx) vintage.  Have you tried ANA/DISK/READ_CHECK, or
(if V7.3-2 or later) ANA/DISK/SHADOW?  If the shadow copies are
finding and replacing a bunch of existing bad blocks, once the
shadow copy completes, the bad blocks should be safely sequestered,
and /read_check or /shadow should come up clean.

--
John Santos
Evans Griffiths & Hart, Inc.
781-861-0670 ext 539

Thanks,

I tried ANA/DISK/SHADOW.

On both systems, the command eventually terminated with a SYSTEM-F-
PARITY. There was no
information in the command output file. But DIAG/TRAN showed one
error on both DKA0 and DKA100 on *the same logical block at the same
time* (there are only two disks in the shadow set) on the problem
system. DIAG/TRAN showed one error on DKA100 on the other system.

However before a swapped the disks, the dka100 on the non-problem
(other) DS10 was not logging any unrecoverable errors. And, I never
get errors when I do a tape backup with verify. And, before I
swapped the disks I rarely got errors except when merging the shadow
set on the problem system. The symptom that started my analysis was
the fact that errors were always logged on the problem system when I
did a shadowset merge.

These errors always arise after the 90% point in the shadowset merge
and in ANA/DISK/SHADOW. No errors early ever.
.



Relevant Pages

  • Re: Errors during shadow set merge
    ... I have been getting errors during shadow set merges since I bought ... If you don't have a service contract, order a replacement disk ... Or it could be a problem with SCSI bus termination, ...
    (comp.os.vms)
  • Re: Errors during shadow set merge
    ... I have been getting errors during shadow set merges since I bought ... If you don't have a service contract, order a replacement disk and get a ... I am skeptical that its the disks (In my original message, ...
    (comp.os.vms)
  • Re: Extremely Frustrated with EMC SAN
    ... ten shadow sets comprised of EMC-EMC, EMC-HSJ, and HSJ-HSJ disk ... As individual DGA disks, ... The positive aspect of going to EMC is that my life would be much easier ...
    (comp.os.vms)
  • Re: Shadow Copy Transport Between Servers
    ... We have been looking for a while for a command line ... interface to the Disk Mirror capability in Windows 2003 that would ... Create a persistent transportable shadow copy of the ... Assign drive letters to the new shadow copy disks. ...
    (microsoft.public.windows.file_system)
  • Re: LD devices in shadowsets on fault tolerant cluster
    ... partition the the disks at the controller and shadow the resulting ... I have gotten MSCP serving working with some of the variants of vddriver ... disks on each box pointing at the same container area and let the underlying ...
    (comp.os.vms)