Re: cvs commit: src/sys/fs/nullfs null.h null_subr.c null_vnops.c

From: Terry Lambert (tlambert2_at_mindspring.com)
Date: 06/19/03

  • Next message: Terry Lambert: "Re: cvs commit: src/sys/fs/nullfs null.h null_subr.c null_vnops.c"
    Date: Thu, 19 Jun 2003 01:23:11 -0700
    To: The Hermit Hacker <scrappy@hub.org>
    
    

    The Hermit Hacker wrote:
    > 'K, this kinda hurts ... there are a growing # of us that are actually
    > using unionfs and nullfs on production systems ... not small servers, but
    > several thousand processes with over 100 union mounts ... other then the
    > vnode leak stuff that David has been investigating, I've yet to see
    > anything that I would considering warranting the 'DO NOT USE / CAVEAT
    > EMPTOR' that is in the man pages ... :(

    Use mmap on a bunch of files on a nullfs, and don't do msync()
    to perform an explicit coherency cycle. Modofiy the original
    underlying files. Do this for different areas of partial pages
    on both the nullfs and the FS the nullfs is covering.

    1) There is no explicit coherency notification to the
            covering FS when the covered FS's vnode data is
            modified.

    2) There is no explicit coherency cycle for mapped pages
            when a write occurs, if the page being written is in
            core.

    Basically, in order to support this, you will have to unmap the
    pages for write, take the fault, and then restart the write with
    the knowledge that you need to trigger a write-through (or a
    write-back) as a result of having triggered the fault: in other
    words, an explicit coherency cycle.

    The current nullfs code avoids this by having a 1:1 page mapping
    and using a trick I came up with, which is to get the underlying
    vm_object_t from the underlying vnode, instead of the nullfs
    vnode. But it pays a rather large performance penalty.

    The other problem is that it gives the wrong impression about
    FS stacking in FreeBSD: it give the impression that it works
    in other than the specialized contrived case of nullfs.

    This does not (and can not) work with transformative stacking
    layers, such as a crypto stacking layer, a character set
    translation stacking layer (e.g. a Koi-8 FS NFS mounted on an
    ISO-8859-1 Locale system, which needs the Koi-8 data UTF-8
    encoded before it can be displayed in a file browser), and a
    number of other layers.

    The page trick suggested above also fails in some cases; for
    example, consider the case where you have a very fast disk
    for the first 2K of each file, and a slower disk for the
    remainder of each file (if any). The data break spans a page
    boundary, and therefore you can't deal with it.

    In a similar vein, if you proxy your VOP descriptors to another
    address space, you are screwed, because vnodes are assumed to
    contain vmobject_t's, and these are assumed to be locally
    accessible to the address space in question (how do you implement
    a VOP_GETVOBJECT() when the vnode you are referencing is in user
    space? Is on another node? Etc.?).

    Paging VOPs almost need an internal payload of a page or page
    set, coupled with an address space descriptor, in order to let
    them know if the called party can access them directly, rather
    than needing to call a rendevous data copy operation.

    If you read John Heidemann's Master's thesis (ftp.cs.ucla.edu),
    or the Ficus documentation (same FTP server), which are the
    basis of the stacking vnode framework in BSD4.4-Lite2, and thus
    in FreeBSD, you'll see that these problems have already got
    answers, they just aren't being implemented in FreeBSD, and as
    FreeBSD moves further from the original intended design, it's
    only going to get harder to recover the functionality.

    Really, the stacking in FreeBSD today is pretty much a toy. The
    reason FFS can stack on UFS is that the VOP's that are being
    exported are not really stacked, because they represent two
    non-intersecting set of VOP's: one is for a flat numeric namespace
    (inode numbers) FS, called UFS (or UFS2, or also... formerly..
    MFS), and the upper layer FFS implements a hierarchical namespace
    in the context of the underlying flat numeric namespace.

    There are a couple of interesting things you can do without really
    stacking (causing the VOP namespaces to intersect, thus introducing
    the coherency issue); one of these would be to seperate out the
    disk quota interface. With the exception of the quota VOP that's
    needed, everything else is non-intersecting in the same way that
    the nullfs is non-intersecting: there's no upper layer vmobject_t
    reference needed to implement it. Combine that with the VOP for
    the quota control operations being non-intersecting in the VOP
    namespace (like the VOP for directory operations not being in the
    UFS namespace), and you have sufficient seperation to implement
    quotas in the context of a decoherent stacked cache, because you
    never need to reference bth the upper and lower vnode's vmobject_t
    for a given particular vnode.

    But the FreeBSD implementation is probably far from useful, without
    the coherency notification mechanisms for "upper dirty/write through
    to lower" and "lower dirty/invalidate upper cached copy". Those just
    aren't there, and the framework totally lacks the necessary semantics
    for the second one, at the present time.

    There are a number of deadlock issues in the unionfs case; most
    people don'y use that, and use the union mount option, which is
    not the same thing at all. Most of these problems are centered
    around things like relookup, etc., which have to drop and then
    reacquire a lock to avoid an internal deadlok (e.g. "rename");
    by doing this, they open a small race window, in which it's
    possible, with the right call-path pressure, to create a deadlock
    between concurrently executing threads of control. The window
    is much more pronounced on SMP systems, which are statistically
    much more likely to hit it.

    Followups set to Freebsd-FS.

    -- Terry
    _______________________________________________
    freebsd-arch@freebsd.org mailing list
    http://lists.freebsd.org/mailman/listinfo/freebsd-arch
    To unsubscribe, send any mail to "freebsd-arch-unsubscribe@freebsd.org"


  • Next message: Terry Lambert: "Re: cvs commit: src/sys/fs/nullfs null.h null_subr.c null_vnops.c"

    Relevant Pages