SUMMARY: device database locked by cluster member

From: Mike Broderick (broderic_at_MIT.EDU)
Date: 04/29/03

  • Next message: Marais, Peter: "IO Benchmarks"
    Date: Mon, 28 Apr 2003 18:08:48 -0400
    To: tru64-unix-managers <tru64-unix-managers@ornl.gov>
    
    

    Thanks to Dr Thomas Blinn (@HP) for the only answer who confirmed a
    reboot was best solution short of hacking inside the running kernel with
    uncertain consequences which I certainly did not feel knowledgeable
    enough to be doing. A reboot did clear up the problem.

    He also suggested I could force the system to panic and/or crash and
    analize the dump. His reply is atached below as is my original post.
    .
                                             _Mike

    Almost certainly reboot city -- you have things locked up inside of the
    kernel, probably one of the (seemingly many) interactions between the
    CAM (mass storage) subsystem and hardware management. There are no
    handy locks you can poke. If you knew enough about the internals of
    the kernel (I don't, and I'm now the DRI for "hwmgr" stuff), you just
    might be able to figure out why the system doesn't want to allow you
    to kill the two commands (there are some places in the internal HWC
    code where you might wind up sleeping uninterruptibly because one or
    more of the hardware management databases inside the kernel is in an
    inconsistent state, something that suggests some other thread is off
    changing something, and the code sleeps and then wakes up later and
    re-checks, but of course, if there's a bug...). And you might even
    be able to tweak things so that they would progress, but then you'd
    probably get a panic anyway..

    If you've got a support contract, and you know how to force a halt
    and get a crash dump (or how to force a panic, one method is to get
    into the kernel debugger dbx and set the global variable "hz" to be
    zero, since it's used in divisions all the time, you quickly die
    with a divide by zero, and you get a crash dump), then you could
    send the crash off to the CSC and they might be able to identify a
    known or new problem.

    There are newer patch kits (I think) than the one you've got, so it
    is possible the problem you're seeing has been fixed.

    A reboot does get things starting from a "clean slate" and depending
    on how just of the scanning and deleting actually got done, it might
    even make the device either usable or gone.

    Tom

    Mike Broderick wrote:

    > One more seemingly important thing. This is a standalone (5.1a+pk1)
    > system (no cluster). _Mike
    >
    >
    > -------- Original Message --------
    > Subject: device database locked by cluster member
    > Date: Thu, 24 Apr 2003 16:59:24 -0400
    > From: Mike Broderick <broderic@mit.edu>
    > To: tru64-unix-managers <tru64-unix-managers@ornl.gov>
    >
    >
    >
    > I get this message trying to access the device db:
    >
    > # dsfmgr -s
    > dsfmgr: NOTE: waiting for Session Lock held by member #0. At Thu Apr
    > 24 16:53:41 2003
    > ^C
    > #
    >
    > We were trying to clean up an old device earlier but these two hwmgr
    > commands just hung (not kill-able):
    >
    > # ps -ef | grep hwmgr | grep -v grep
    > root 397765 397719 0.0 15:53:02 pts/1 0:00.10 hwmgr sc sc
    > root 398377 397758 0.0 16:00:48 pts/2 0:00.04 hwmgr delete
    > scsi -did 17
    > #
    >
    > The device being deleted above is in a strange state:
    >
    > # hwmgr sh sc | grep 17
    > 109: 17 pine tape none 0 1 tape113
    > # hwmgr sh sc -id 109 -full
    >
    > SCSI DEVICE DEVICE DRIVER NUM DEVICE FIRST
    > HWID: DEVICEID HOSTNAME TYPE SUBTYPE OWNER PATH FILE VALID
    > PATH
    > -------------------------------------------------------------------------
    > 109: 17 pine tape none 0 1 tape113
    >
    > WWID:06100036:"QUANTUM DLT7000
    > :d01l00034:1000-00e0-0201-a2d1"
    >
    >
    > BUS TARGET LUN PATH STATE
    > ------------------------------
    > 5 8 34 stale
    > # hwmgr sh comp -id 109 -full
    >
    > HWID: HOSTNAME FLAGS SERVICE COMPONENT NAME
    > -----------------------------------------------
    > 109: pine rcd-i iomap SCSI-WWID:06100036:"QUANTUM
    > DLT7000 :d01l00034:1000-00e0-0201-a2d1"
    >
    > DSF GROUP
    > INSTANCE GRPFLAGS GROUPID SUBSYSTEM BASENAME L1 L2
    > ---------------------------------------------------------
    > 0 40 54 cam_tape tape113 tape (null)
    >
    > DEVICE NODE
    > ID LBdevT LCdevT CBdevT CCdevT BFlags CFlags Class Suffix
    > L3B L3C
    >
    > -------------------------------------------------------------------------------
    >
    > 16 0 330045e 0 1300307 0x0 0x861 0x0 . . .
    > 15 0 330044f 0 130031a 0x0 0x861 0x0 _d7
    > (null) norewind
    >
    > COMPONENT INCONSISTENCY
    > -----------------------
    > Cluster shared component has no entry in the cluster database.
    >
    >
    >
    > How can I clear this up w/o rebooting? Is there a lock file or
    > something somewhere I can delete?
    >
    > _Mike
    >
    >
    >
    >


  • Next message: Marais, Peter: "IO Benchmarks"

    Relevant Pages

    • Re:SOLVVED vinum crashes the Box... WRONG POSTING
      ... vinum crashes the Box by using more then ten disks on ... > With this configuration vinum crashes the Machine with this Message: ... > I'm going to enable the kernel debugger now... ... >> while devfs rules works for those, but they vanish on reboot. ...
      (freebsd-stable)
    • instant reboot when trying to load recent RELENG_5 kernel
      ... but it still happens on a kernel compiled with CPUTYPE ... instantly reboot with ACPI enabled. ... port ... can't assign resources ...
      (freebsd-current)
    • Re: 5.1-BETA i386 spontaneous reboot?
      ... I noticed this as well on my workstation, with kernel and world ... > spontaneously reboot on me. ... Nothing was left in the log file other than the ... and the machine never produced a crash dump. ...
      (freebsd-current)
    • Re: Microsoft Internet Explorer Malformed HTML Parsing Denial of Service Vulnerability
      ... it would be too complicated to implement, so we better restart the whole ... NEVER have to reboot... ... I meant kernel services from a system view, ... Some probably are small design errors and some probably are deep structural ...
      (alt.computer.security)
    • Re: kernel BUG at mm/filemap.c:332!
      ... Beeing just several hundred entries I know that I have at least one more ... kernel with lilo. ... doing something which also stuck in D state) so I had to reboot it "hard". ... > tell Nathan more about the filesystem setup (block size, ...
      (Linux-Kernel)