Solaris9 disk mirroring "no majority consensus" panic problem

From: rader (rader_at_hep.wisc.edu)
Date: 03/23/05


Date: Wed, 23 Mar 2005 14:29:10 -0600

I've got a system with two identical disks, identical partitioning,
no partition overlaps and six metadbs (three on each disk.)

When I pull power to a drive (thus killing exactly 50% of the
metadbs), I get an unexpected kernel panic:

  panic[cpu0]/thread=3000065b7c0: md: Panic due to lack of DiskSuite state
   database replicas. Fewer than 50% of the total were available,
   so panic to ensure data integrity.

More gory details follow my sig.

The Sol9 Vol Mgr Admin Guide clearly states that the system should
only panic if *fewer* than half of the state database replicas
are available. [1]

I've used and thoroughly tested this configuration on Solaris7 on
a number of systems without any problems.

What the heck am I doing wrong??

(Yes... I have the latest patch cluster installed.)

steve
- - -
systems & network manager
high energy physics
university of wisconsin

  [1]
http://docs.sun.com/app/docs/doc/816-4518/6mannlddp?q=volume+manager&a=view

------------------------------------------------------------------------------

salt(root): uname -a
SunOS salt 5.9 Generic_118558-04 sun4u sparc SUNW,Ultra-5_10

salt(root): /usr/ucb/df | egrep '^Filesystem|^/dev'
Filesystem kbytes used avail capacity Mounted on
/dev/md/dsk/d10 35375205 2820520 32200933 9% /

salt(root): swap -l
swapfile dev swaplo blocks free
/dev/md/dsk/d20 85,20 16 4198304 4198304

salt(root): echo -e "0\np\np\nq" | format | egrep '^ 0|^ [0-9]'
        0. c0t0d0 <ST340014A cyl 19156 alt 2 hd 16 sec 255>
   0 root wm 0 - 17611 34.26GB (17612/0/0) 71856960
   1 swap wu 17612 - 18640 2.00GB (1029/0/0) 4198320
   2 backup wm 0 - 19155 37.27GB (19156/0/0) 78156480
   3 usr wm 18641 - 18646 11.95MB (6/0/0) 24480
   4 usr wm 18647 - 18652 11.95MB (6/0/0) 24480
   5 usr wm 18653 - 18658 11.95MB (6/0/0) 24480
   6 unassigned wm 0 0 (0/0/0) 0
   7 unassigned wm 0 0 (0/0/0) 0

salt(root): echo -e "1\np\np\nq" | format | egrep '^ 1|^ [0-9]'
        1. c0t1d0 <ST340014A cyl 19156 alt 2 hd 16 sec 255>
   0 root wm 0 - 17611 34.26GB (17612/0/0) 71856960
   1 swap wm 17612 - 18640 2.00GB (1029/0/0) 4198320
   2 backup wu 0 - 19155 37.27GB (19156/0/0) 78156480
   3 usr wm 18641 - 18646 11.95MB (6/0/0) 24480
   4 usr wm 18647 - 18652 11.95MB (6/0/0) 24480
   5 usr wm 18653 - 18658 11.95MB (6/0/0) 24480
   6 unassigned wm 0 0 (0/0/0) 0
   7 unassigned wm 0 0 (0/0/0) 0
d20: Mirror
     Submirror 0: d21
       State: Okay
     Submirror 1: d22
       State: Okay
     Pass: 1
     Read option: roundrobin (default)
     Write option: parallel (default)
     Size: 4198320 blocks (2.0 GB)

d21: Submirror of d20
     State: Okay
     Size: 4198320 blocks (2.0 GB)
     Stripe 0:
         Device Start Block Dbase State Reloc Hot Spare
         c0t0d0s1 0 No Okay Yes

d22: Submirror of d20
     State: Okay
     Size: 4198320 blocks (2.0 GB)
     Stripe 0:
         Device Start Block Dbase State Reloc Hot Spare
         c0t1d0s1 0 No Okay Yes

d10: Mirror
     Submirror 0: d11
       State: Okay
     Submirror 1: d12
       State: Resyncing
     Resync in progress: 0 % done
     Pass: 1
     Read option: roundrobin (default)
     Write option: parallel (default)
     Size: 71856960 blocks (34 GB)Device Reloc Device ID
c0t1d0 Yes id1,dad@AST340014A=3JXBV7AX
c0t0d0 Yes id1,dad@AST340014A=3JXBSK9N

------------------------------------------------------------------------------

Sun Ultra 5/10 UPA/PCI (UltraSPARC-IIi 400MHz), No Keyboard
OpenBoot 3.25, 128 MB (50 ns) memory installed, Serial #15771670.
Ethernet address 8:0:20:f0:a8:16, Host ID: 80f0a816.

Initializing Memory
[...]
Rebooting with command: boot
Boot device: disk:a File and args:
SunOS Release 5.9 Version Generic_118558-04 64-bit
Copyright 1983-2003 Sun Microsystems, Inc. All rights reserved.
Use is subject to license terms.
[...]
WARNING: forceload of misc/md_trans failed
WARNING: forceload of misc/md_raid failed
WARNING: forceload of misc/md_hotspares failed
WARNING: forceload of misc/md_sp failed
configuring IPv4 interfaces: hme0.
Hostname: salt
The system is coming up. Please wait.
Setting netmask of hme0 to 255.255.254.0
Setting default IPv4 interface for multicast: add net 224.0/4: gateway salt
syslog service starting.
starting NetWorker daemons:
  nsrexecd

Starting cfexecd
Starting cfservd
Starting NET-SNMP snmpd
volume management starting. %l0-3: 0000030000c3dd70 000003000032cd38
000003000005b8f8 0000000000000000
   %l4-7: 000003000032cd38 0000000000000000 0000000000000000
0000030000516ee8
000002a100131610 md:mdstrategy+d0 (30000516ee8, 3000065b7c0,
2a100131778, 0, 0, 1439400)
   %l0-3: 0000030000516ee8 ffffffffffffffff 0000030000516ee8
00000000016367b0
   %l4-7: 000003000000f8c8 0000000000002000 000002a750314000
000002a750314000
000002a1001316c0 genunix:bdev_strategy+90 (30000516ee8, 3000065b7c0, 20,
10, 4000, 400)
   %l0-3: 00000000011bc9d4 0000000000004000 0000000000000001
0000000000000500
   %l4-7: 00000700000389a0 0000000000000400 000002a100131850
000002a100131848
000002a100131790 ufs:ufs_putapage+384 (2a100131938, 2a100131930, 0,
2a100131930, 400, 30000936a78)
   %l0-3: 0000030000936be0 00000700000389a0 0000030000043348
0000000000000000
   %l4-7: 0000030000659508 0000000000000500 0000030000516ee8
00000700000389a0
000002a100131870 ufs:ufs_putpages+2e4 (0, 2000, 30000267f28, 400, 1, 400)
   %l0-3: 00000700000389a0 0000030000936b18 0000030000936a78
0000030000936bd8
   %l4-7: 0000000000000400 0000000000006000 0000000000004000
0000000000000000
000002a100131940 genunix:fsflush_do_pages+330 (30000936b18, 0, 0,
142b7f0, 17, 1)
   %l0-3: 0000000000004000 000000000119ca68 0000000000000000
00000000014444d8
   %l4-7: 00000000000000dd 0000000000000002 0000000000000000
000000000144f400
000002a100131a10 genunix:fsflush+3f4 (c, 149b360, 149b000, ef, 144bd88,
1492800)
   %l0-3: 000000000144f818 00000300000453d8 000003000027dbf8
000003000027dbc8
   %l4-7: 0000000000000bb8 0000000000000080 0000000000000000
000000000144f818

syncing file systems...WARNING: md: d11: write error on /dev/dsk/c0t0d0s0
WARNING: md: d12: write error on /dev/dsk/c0t1d0s0
  [1] 1WARNING: md: d11: write error on /dev/dsk/c0t0d0s0

The system is ready.

salt console login:

Mar 23 14:07:27 salt uata: WARNING: timeout: reset target chno = 0 targ = 1

panic[cpu0]/thread=3000065b7c0: md: Panic due to lack of DiskSuite state
  database replicas. Fewer than 50% of the total were available,
  so panic to ensure data integrity.

000002a1001313d0 md:mddb_commitrec_wrapper+84 (a, 3000065b7c0, 20, 0,
144bd88, 1492800)
   %l0-3: 0000000000000000 0000000000000001 000000000000000a
000003000027dbc8
   %l4-7: 00000000011e5ad8 00000000014bd800 0000000000000000
000000000144f818
000002a100131480 md_mirror:mirror_mark_resync_region+2b4 (3000032cd38,
11d0ce, 11d0cf, 3, 2, 3000032cd38)
   %l0-3: 0000000000000000 0000000000000010 0000000000000010
000003000032d078
   %l4-7: 0000000000000001 000003000032d080 0000000001438800
0000030000c3dd70

d11: Submirror of d10
     State: Okay
     Size: 71856960 blocks (34 GB)
     Stripe 0:
         Device Start Block Dbase State Reloc Hot Spare
         c0t0d0s0 0 No Okay Yes

d12: Submirror of d10
     State: Resyncing
     Size: 71856960 blocks (34 GB)
     Stripe 0:
         Device Start Block Dbase State Reloc Hot Spare
         c0t1d0s0 0 No Okay Yes

Device Relocation Information:

------------------------------------------------------------------------------

salt(root): metadb
         flags first blk block count
      a m p luo 16 8192 /dev/dsk/c0t0d0s3
      a p luo 16 8192 /dev/dsk/c0t0d0s4
      a p luo 16 8192 /dev/dsk/c0t0d0s5
      a p luo 16 8192 /dev/dsk/c0t1d0s3
      a p luo 16 8192 /dev/dsk/c0t1d0s4
      a p luo 16 8192 /dev/dsk/c0t1d0s5

------------------------------------------------------------------------------

salt(root): metastat



Relevant Pages