Re: Options for synchronising filesystems

From: Eric Anderson (anderson_at_centtech.com)
Date: 09/26/05

  • Next message: Brian Candler: "Re: Options for synchronising filesystems"
    Date: Mon, 26 Sep 2005 07:46:09 -0500
    To: filip wuytack <filip@wuytack.net>
    
    

    filip wuytack wrote:
    >
    >
    > Eric Anderson wrote:
    >
    >> Brian Candler wrote:
    >>
    >>> Hello,
    >>>
    >>> I was wondering if anyone would care to share their experiences in
    >>> synchronising filesystems across a number of nodes in a cluster. I
    >>> can think
    >>> of a number of options, but before changing what I'm doing at the
    >>> moment I'd
    >>> like to see if anyone has good experiences with any of the others.
    >>>
    >>> The application: a clustered webserver. The users' CGIs run in a chroot
    >>> environment, and these clearly need to be identical (otherwise a CGI
    >>> running
    >>> on one box would behave differently when running on a different box).
    >>> Ultimately I'd like to synchronise the host OS on each server too.
    >>>
    >>> Note that this is a single-master, multiple-slave type of filesystem
    >>> synchronisation I'm interested in.
    >>>
    >>>
    >>> 1. Keep a master image on an admin box, and rsync it out to the
    >>> frontends
    >>> -------------------------------------------------------------------------
    >>>
    >>>
    >>> This is what I'm doing at the moment. Install a master image in
    >>> /webroot/cgi, add packages there (chroot /webroot/cgi pkg_add ...), and
    >>> rsync it. [Actually I'm exporting it using NFS, and the frontends run
    >>> rsync
    >>> locally when required to update their local copies against the NFS
    >>> master]
    >>>
    >>> Disadvantages:
    >>>
    >>> - rsyncing a couple of gigs of data is not particularly fast, even
    >>> when only
    >>> a few files have changed
    >>>
    >>> - if a sysadmin (wrongly) changes a file on a front-end instead of on
    >>> the
    >>> master copy in the admin box, then the change will be lost when the next
    >>> rsync occurs. They might think they've fixed a problem, and then
    >>> (say) 24
    >>> hours later their change is wiped. However if this is a config file, the
    >>> fact that the old file has been reinstated might not be noticed until
    >>> the
    >>> daemon is restarted or the box rebooted - maybe months later. This I
    >>> think
    >>> is the biggest fundamental problem.
    >>>
    >>> - files can be added locally and they will remain indefinitely
    >>> (unless we
    >>> use rsync --delete which is a bit scary). If this is done then adding
    >>> a new
    >>> machine into the cluster by rsyncing from the master will not pick up
    >>> these
    >>> extra files.
    >>>
    >>> So, here are the alternatives I'm considering, and I'd welcome any
    >>> additional suggestions too.
    >>
    >>
    >>
    >> Here's a few ideas on this: do multiple rsyncs, one for each top level
    >> directory. That might speed up your total rsync process. Another
    >> similar method is using a content revisioning system. This is only
    >> good for some cases, but something like subversion might work ok here.
    >>
    >>
    >>
    >>> 2. Run the images directly off NFS
    >>> ----------------------------------
    >>>
    >>> I've had this running before, even the entire O/S, and it works just
    >>> fine.
    >>> However the NFS server itself then becomes a critical
    >>> single-point-of-failure: if it has to be rebooted and is out of
    >>> service for
    >>> 2 minutes, then the whole cluster is out of service for that time.
    >>>
    >>> I think this is only feasible if I can build a highly-available NFS
    >>> server,
    >>> which really means a pair of boxes serving the same data. Since the
    >>> system
    >>> image is read-only from the point of view of the frontends, this
    >>> should be
    >>> easy enough:
    >>>
    >>> frontends frontends
    >>> | | | | | |
    >>> NFS -----------> NFS
    >>> server 1 sync server 2
    >>>
    >>> As far as I know, NFS clients don't support the idea of failing over
    >>> from
    >>> one server to another, so I'd have to make a server pair which
    >>> transparently
    >>> fails over.
    >>>
    >>> I could make one NFS server take over the other server's IP address
    >>> using
    >>> carp or vrrp. However, I suspect that the clients might notice. I
    >>> know that
    >>> NFS is 'stateless' in the sense that a server can be rebooted, but for a
    >>> client to be redirected from one server to the other, I expect that
    >>> these
    >>> filesytems would have to be *identical*, down to the level of the inode
    >>> numbers being the same.
    >>>
    >>> If that's true, then rsync between the two NFS servers won't cut it.
    >>> I was
    >>> thinking of perhaps using geom_mirror plus ggated/ggatec to make a
    >>> block-identical read-only mirror image on NFS server 2 - this also
    >>> has the
    >>> advantage that any updates are close to instantaneous.
    >>>
    >>> What worries me here is how NFS server 2, which has the mirrored
    >>> filesystem
    >>> mounted read-only, will take to having the data changed under its
    >>> nose. Does
    >>> it for example keep caches of inodes in memory, and what would happen if
    >>> those inodes on disk were to change? I guess I can always just
    >>> unmount and
    >>> remount the filesystem on NFS server 2 after each change.
    >>
    >>
    >>
    >> I've tried doing something similar. I used fiber attached storage,
    >> and had multiple hosts mounting the same partition. It seemed as
    >> though when host A mounted the filesystem read-write, and then host B
    >> mounted it read-only, any changes made by host A were not seen by B,
    >> and even remounting did not always bring it up to current state. I
    >> believe it has to do with the buffer cache and host A's desire to keep
    >> things (like inode changes, block maps, etc) in cache and not write
    >> them to disk. FreeBSD does not currently have a multi-system cache
    >> coherency protocol to distribute that information to other hosts.
    >> This is something I think would be very useful for many people. I
    >> suppose you could just mount the filesystem when you know a change has
    >> happened, but you still may not see the change. Maybe mounting the
    >> filesystem on host A with the sync option would help.
    >>
    >>> My other concern is about susceptibility to DoS-type attacks: if one
    >>> frontend were to go haywire and start hammering the NFS servers
    >>> really hard,
    >>> it could impact on all the other machines in the cluster.
    >>>
    >>> However, the problems of data synchronisation are solved: any change
    >>> made on
    >>> the NFS server is visible identically to all front-ends, and
    >>> sysadmins can't
    >>> make changes on the front-ends because the NFS export is read-only.
    >>
    >>
    >>
    >> This was my first thought too, and a highly available NFS server is
    >> something any NFS heavy installation wants (needs). There are a few
    >> implementations of clustered filesystems out there, but non for
    >> FreeBSD (yet). What that allows is multiple machines talking to a
    >> shared storage with read/write access. Very handy, but since you only
    >> need read-only access, I think your problem is much simpler, and you
    >> can get away with a lot less.
    >>
    >>
    >>> 3. Use a network distributed filesystem - CODA? AFS?
    >>> ----------------------------------------------------
    >>>
    >>> If each frontend were to access the filesystem as a read-only network
    >>> mount,
    >>> but have a local copy to work with in the case of disconnected
    >>> operation,
    >>> then the SPOF of an NFS server would be eliminated.
    >>>
    >>> However, I have no experience with CODA, and although it's been in
    >>> the tree
    >>> since 2002, the README's don't inspire confidence:
    >>>
    >>> "It is mostly working, but hasn't been run long enough to be sure
    >>> all the
    >>> bugs are sorted out. ... This code is not SMP ready"
    >>>
    >>> Also, a local cache is no good if the data you want during disconnected
    >>> operation is not in the cache at that time, which I think means this
    >>> idea is
    >>> not actually a very good one.
    >>
    >>
    >>
    >> There is also a port for coda. I've been reading about this, and
    >> it's an interesting filesystem, but I'm just not sure of it's
    >> usefulness yet.
    >>
    >>
    >>> 4. Mount filesystems read-only
    >>> ------------------------------
    >>>
    >>> On each front-end I could store /webroot/cgi on a filesystem mounted
    >>> read-only to prevent tampering (as long as the sysadmin doesn't
    >>> remount it
    >>> read-write of course). That would work reasonably well, except that
    >>> being
    >>> mounted read-only I couldn't use rsync to update it!
    >>>
    >>> It might also work with geom_mirror and ggated/ggatec, except for the
    >>> issue
    >>> I raised before about changing blocks on a filesystem under the nose
    >>> of a
    >>> client who is actively reading from it.
    >>
    >>
    >>
    >> I suppose you could mount r/w only when doing the rsync, then switch
    >> back to ro once complete. You should be able to do this online,
    >> without any issues or taking the filesystem offline.
    >>
    >>
    >>> 5. Using a filesystem which really is read-only
    >>> -----------------------------------------------
    >>>
    >>> Better tamper-protection could be had by keeping data in a filesystem
    >>> structure which doesn't support any updates at all - such as cd9660 or
    >>> geom_uzip.
    >>>
    >>> The issue here is how to roll out a new version of the data. I could
    >>> push
    >>> out a new filesystem image into a second partition, but it would then be
    >>> necessary to unmount the old filesystem and remount the new on the same
    >>> place, and you can't really unmount a filesystem which is in use. So
    >>> this
    >>> would require a reboot.
    >>>
    >>> I was thinking that some symlink trickery might help:
    >>>
    >>> /webroot/cgi -> /webroot/cgi1
    >>> /webroot/cgi1 # filesystem A mounted here
    >>> /webroot/cgi2 # filesystem B mounted here
    >>>
    >>> It should be possible to unmount /webroot/cgi2, dd in a new image,
    >>> remount
    >>> it, and change the symlink to point to /webroot/cgi2. After a little
    >>> while,
    >>> hopefully all the applications will stop using files in
    >>> /webroot/cgi1, so
    >>> this one can be unmounted and a new one put in its place on the next
    >>> update.
    >>> However this is not guaranteed, especially if there are long-lived
    >>> processes
    >>> using binary images in this partition. You'd still have to stop and
    >>> restart
    >>> all those processes.
    >>>
    >>> If reboots were acceptable, then the filesystem image could also be
    >>> stored
    >>> in ramdisk pulled in via pxeboot. This makes sense especially for
    >>> geom_uzip
    >>> where the data is pre-compressed. However I would still prefer to avoid
    >>> frequent reboots if at all possible. Also, whilst a ramdisk might be
    >>> OK for
    >>> the root filesystem, a typical CGI environment (with perl, php, ruby,
    >>> python, and loads of libraries) would probably be too large anyway.
    >>>
    >>>
    >>> 6. Journaling filesystem replication
    >>> ------------------------------------
    >>>
    >>> If the data were stored on a journaling filesystem on the master box,
    >>> and
    >>> the journal logs were distributed out to the slaves, then they would all
    >>> have identical filesystem copies and only a minimal amount of data would
    >>> need to be pushed out to each machine on each change. (This would be
    >>> rather
    >>> like NetApps and their snap-mirroring system). However I'm not aware
    >>> of any
    >>> journaling filesystem for FreeBSD, let alone whether it would support
    >>> filesystem replication in this way.
    >>
    >>
    >>
    >> There is a project underway for UFSJ (UFS journaling). Maybe once it
    >> is complete, and bugs are ironed out, one could implement a journal
    >> distribution piece to send the journal updates to multiple hosts and
    >> achieve what you are thinking, however, that only distributes the
    >> meta-data, and not the actual data.
    >>
    >>
    > Have a look at dragonfly BSD for this. They are working on a journaling
    > filesystem that will do just that.

    Do you have a link to some information on this? I've been looking at
    Dragonfly, but I'm having trouble finding good information on what is
    already working, in planning, etc.

    Eric

    -- 
    ------------------------------------------------------------------------
    Eric Anderson        Sr. Systems Administrator        Centaur Technology
    Anything that works is better than anything that doesn't.
    ------------------------------------------------------------------------
    _______________________________________________
    freebsd-isp@freebsd.org mailing list
    http://lists.freebsd.org/mailman/listinfo/freebsd-isp
    To unsubscribe, send any mail to "freebsd-isp-unsubscribe@freebsd.org"
    

  • Next message: Brian Candler: "Re: Options for synchronising filesystems"

    Relevant Pages