HELP - stale partitions



Hi,

[after years without posting it's by second post on different issues within an hour :-)]

I have a serious problem which I have difficulties resolving, on a critical machine (our NFS server).

Environment:

IBM pseries 7028-6c1/p610. aix 5.2ML05 . rootvg is mirrored (hdisk0 & hdisk1, 18GB each). This setup has not changed for the last 5 years. Latest ML was installed around 5/2005. No apparent disk/scsi/storage ever with this machine.

Last week (Dec. 6th.) "stale partition" messages began to appear for all rootvg lv's, and err msgs about hdisk0. Consulting with my dealer's technical support, we tried to unmirror and then mirror rootvg. Mirrroing failed, as can be seen below:

Before command completion, additional instructions may appear below.

0516-1296 lresynclv: Unable to completely resynchronize volume.
The logical volume has bad-block relocation policy turned off.
This may have caused the command to fail.
0516-934 /etc/syncvg: Unable to synchronize logical volume hd6.
0516-932 /etc/syncvg: Unable to synchronize volume group rootvg.
0516-1124 mirrorvg: Quorum requirement turned off, reboot system for this
to take effect for rootvg.
0516-1126 mirrorvg: rootvg successfully mirrored, user should perform
bosboot of system to initialize boot records. Then, user must modify
bootlist to include: hdisk1 hdisk0.



Then we replaced hdisk0 after breaking the mirror, hoping to remirror after adding the new hdisk0 to rootvg. The system wouldn't even add hdisk0 to rootvg ..

Next we did an alt_disk_install from hdisk1 on hdisk0, rebooted with hdisk0 and replaced hdisk1 (so now both disks are new). Note that as lat_disk_install filesets were not installed, we've installed them, but unfortunately we forgot to install fixes for them, and used just the base from the aix 5.2 CDs, giving level of 10 for these filesets. Then managed to mirror rootvg, but after about 30 hours the machine booted, and I have a zillion messages of the form:

EAA3D429 1215095906 U S LVDD PHYSICAL PARTITION MARKED STALE

Immediately after the restart te following message appeared:
9D035E4D 1214165306 P S SYSVMM DATA STORAGE INTERRUPT, PROCESSOR

The deatiled msg for that event was:

LABEL: DSI_PROC
IDENTIFIER: 9D035E4D

Date/Time: Thu Dec 14 16:53:23 IST
Sequence Number: 3019
Machine Id: 0056BC9A4C00
Node Id: aeserv
Class: S
Type: PERM
Resource Name: SYSVMM

Description
DATA STORAGE INTERRUPT, PROCESSOR

Probable Causes
SOFTWARE PROGRAM

Failure Causes
SOFTWARE PROGRAM

Recommended Actions
IF PROBLEM PERSISTS THEN DO THE FOLLOWING
CONTACT APPROPRIATE SERVICE REPRESENTATIVE

Detail Data
DATA STORAGE INTERRUPT STATUS REGISTER
0000 0000 0000 0000
SEGMENT REGISTER, SEGREG
4000 0000 0000 0000
DATA STORAGE INTERRUPT ADDRESS REGISTER
2000 71A6 F000 0000
EXVAL
2FF3 8DD0 0000 0000




After that began messages of the type:
0BA49C99 1214165606 T H scsi0 SCSI BUS ERROR and then the physical partition stale and op. notification msgs.


At the moment, lspv dows NOT show hdisk0 and lsvg rootvg shows 1 active and 1 stale partition.


If required I'll send the detailed logs, but I didn't think it's right to send it to everyone on the list ...


I'd appreciate any help, in particular concerning what wrong - is it s/w or h/w ? If h/w then what ? scsi ? scsi backplane ?

Regards,
/Zvika


~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Zvika Bar-Deroma

Systems and Network manager Phone: (+972)-4-829-2706 ; Fax : (+972)-4-829-2315
Faculty of Aerospace Engineering, Technion, Haifa 32000, Israel

e-mail : zvika@xxxxxxxxxxxxxxxxxxxxx
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~



Relevant Pages

  • Re: HELP - stale partitions
    ... Last week "stale partition" messages began to appear for all ... and err msgs about hdisk0. ... we tried to unmirror and then mirror rootvg. ... Then we replaced hdisk0 after breaking the mirror, ...
    (AIX-L)
  • Re: HELP - stale partitions
    ... and err msgs about hdisk0. ... we tried to unmirror and then mirror rootvg. ... Then we replaced hdisk0 after breaking the mirror, ... lspv dows NOT show hdisk0 and lsvg rootvg shows 1 active ...
    (AIX-L)
  • Re: HELP - stale partitions
    ... Looks like a rootvg hdisk. ... Subject: HELP - stale partitions ... Then we replaced hdisk0 after breaking the mirror, ...
    (AIX-L)
  • Re: Re-mirroring rootvg
    ... that we're sure all is well, we need to re-mirror rootvg again. ... I use the following commands if both disk have the same size and the ... Mirror rootvg using mirrorvg or mklvcopy. ...
    (comp.unix.aix)
  • stale root partiition - AIX 5.3
    ... to take effect for rootvg. ... 0516-1135 unmirrorvg: The unmirror of the volume group failed. ... shows 88 stale partitions - which is the amount of partitions for /. ... mirror is apparently stale. ...
    (comp.unix.aix)