Re: Cleaning up vgone.

From: Matthew Dillon (dillon_at_apollo.backplane.com)
Date: 03/10/05

  • Next message: Jeff Roberson: "Re: Cleaning up vgone."
    Date: Thu, 10 Mar 2005 02:26:34 -0800 (PST)
    To: Jeff Roberson <jroberson@chesapeake.net>
    
    

    :I've run into a few vclean races and some related problems with VOP_CLOSE
    :not using locks. I've made some fairly major changes to the way vfs
    :handles vnode teardown in the process of fixing this. I'll summarize what
    :I've done here.
    :
    :The main problem with teardown was the two stage locking scheme involving
    :the XLOCK. I got rid of the XLOCK and simply require the vnode lock
    :throughout the whole operation. To accommodate this, VOP_INACTIVE,
    :VOP_RECLAIM, VOP_CLOSE, and VOP_REVOKE all require the vnode lock. As
    :does vgone().
    :
    :Prior to this, vgone() would set XLOCK and then do a LK_DRAIN to make
    :sure there were no callers waiting in VOP_LOCK so that they would always
    :see the VI_XLOCK and know that the vnode had changed identities. Now,
    :vgone sets XLOCK, and all lockers who use vget() and vn_lock() check for
    :VI_DOOMED before and after acquiring the vnode lock. To wait for the
    :transition to complete, you simply wait on the vnode lock.
    :
    :This really only required minor changes of the filesystems in the tree.
    :Most only required the removal of a VOP_UNLOCK in VOP_INACTIVE, and a few
    :acquired the lock in VOP_CLOSE to do operations which they otherwise could
    :not. There is one change to ffs and coda which inspect v_data in their
    :vop_lock routines. This is only safe with the interlock held, where
    :before the XLOCK would have protected v_data in the one case that could
    :lead to panic.
    :
    :The patch is available at http://www.chesapeake.net/~jroberson/vgone.diff
    :
    :Cheers,
    :Jeff

        I did basically the same thing in DragonFly. I can only caution that
        you have to be very, *very* careful to ensure that there are no races
        between entities trying to ffs_vget() through the inode verses the
        related vnode undergoing a vgone(). Be aware of the fact that there
        is a major non-atomicy problem between the destruction of the vnode
        and the clearing of the inode's bit in the bitmap and the destruction
        of the inode in the inode hash table.

        In DragonFly there were races in the inode-reuse case where an inode
        would be reused before it was entirely disconnected from the vnode.
        This case can occur because the inode's bitmap bit is cleared before
        the vnode is unlocked (and you have to do it that way, you can't clear
        the inode's bitmap bit after the vnode has been unlocked without creating
        yet more issues).

        This means that a totally unrelated process creating a file or directory
        could allocate the SAME inode number out of the inode bitmap and call
        FFS_VGET() *BEFORE* the original vnode controlling that inode number
        finishes being reclaimed. In fact, the original inode is still pointing
        at the vnode undergoing destruction and the vnode's v_data is still
        pointing at the inode. The result: bad things happen.

        It got so hairy in DFly that I wound up reordering the code as follows:

            (1) lock vnode
            (2) free the inode in the bitmap (UFS_IFREE() equivalent)
            (3) Set v_data to NULL
            (4) Remove inode from the inode hash table (clear back pointer)
                    [inode now fully disassociated]
            (5) unlock vnode.

            *** Modify the bitmap allocation code for inodes to check whether the
                candidate inode is still present in the inode hash and SKIP THAT
                INODE if it is. (This in the UFS code).

        If you do not do that you wind up with a case where the filesystem cannot
        differentiate between someone trying to lock the original vnode and
        someone trying to vget() the vnode related to a newly created file whos
        inode was just allocated out of the inode bitmap.

        The word 'nasty' doesn't even begin to describe the problem. There are
        *THREE* races here: The vnode lock, the inode hash table, AND the
        inode bitmap. Due to disk I/O it is possible for the system to block
        in between any of the related operations so you have to make sure that
        races against all three points are handled properly. My solution was
        to end-run around the points by leaving the inode hash table intact until
        the very end. In DragonFly I don't call ufs_ihashrem() until the inode
        has been completely divorced from the vnode and that gives me a way
        to check for the race (by checking whether the inode is present in the
        inode hash table or not).

        Another thing I did in DragonFly was to get rid of the crazy handling of
        the vnode's reference count during the reclaim. I don't know what
        FreeBSD-HEAD is doing now, but FreeBSD-4 dropped the ref count to 0
        and then did the crazy VXLOCK stuff. In DragonFly I changed that so
        the ref count is left at 1 (not 0) during the reclaim. This required
        fixing up a few cases that were checking the refcount against an absolute
        0 or 1 (I forget the cases), but it made the code a whole lot easier to
        understand.

                                            -Matt
                                            Matthew Dillon
                                            <dillon@backplane.com>
    _______________________________________________
    freebsd-arch@freebsd.org mailing list
    http://lists.freebsd.org/mailman/listinfo/freebsd-arch
    To unsubscribe, send any mail to "freebsd-arch-unsubscribe@freebsd.org"


  • Next message: Jeff Roberson: "Re: Cleaning up vgone."

    Relevant Pages

    • Re: Cleaning up vgone.
      ... I got rid of the XLOCK and simply require the vnode lock ... you simply wait on the vnode lock. ... > of the inode in the inode hash table. ... > inode was just allocated out of the inode bitmap. ...
      (freebsd-arch)
    • Re: review of ocfs2
      ... I'll make sure that gets noted in the proper documentation. ... Well the side effect here is that it grabbed the lock in which case we want ... The long story is that the truncate log inode always gets locked ...
      (Linux-Kernel)
    • Re: [PATCH] prune_icache_sb
      ... In Linux a filesystem is a dumb layer which sits between the VFS and the ... inode (that this cluster lock is created for). ...
      (Linux-Kernel)
    • Re: ffs snapshot lockup
      ... If the buffer lock has waiters after the buffer has changed identity then ... struct buf_queue_head { ... retrieving revision 1.106 ... * is currently suspending/suspended and vnode has been accessed. ...
      (freebsd-stable)
    • Re: [PATCH] Make udf(4) MPSAFE and use shared lookups
      ... Set VV_ROOT in udf_vgetif we ever return a vnode instead of doing ... Allow lock recursion. ... Tracing pid 977 tid 100087 td 0xc43b4d80 ... lock type udf: EXCL by thread 0xc43b4d80 ...
      (freebsd-current)