Options for synchronising filesystems

From: Brian Candler (B.Candler_at_pobox.com)
Date: 09/24/05

  • Next message: Autoresponder: "Re: Mail Delivery (failure dean@terrabyte.dc.com.au)"
    Date: Sat, 24 Sep 2005 15:10:25 +0100
    To: freebsd-cluster@freebsd.org, freebsd-isp@freebsd.org
    
    

    Hello,

    I was wondering if anyone would care to share their experiences in
    synchronising filesystems across a number of nodes in a cluster. I can think
    of a number of options, but before changing what I'm doing at the moment I'd
    like to see if anyone has good experiences with any of the others.

    The application: a clustered webserver. The users' CGIs run in a chroot
    environment, and these clearly need to be identical (otherwise a CGI running
    on one box would behave differently when running on a different box).
    Ultimately I'd like to synchronise the host OS on each server too.

    Note that this is a single-master, multiple-slave type of filesystem
    synchronisation I'm interested in.

    1. Keep a master image on an admin box, and rsync it out to the frontends
    -------------------------------------------------------------------------

    This is what I'm doing at the moment. Install a master image in
    /webroot/cgi, add packages there (chroot /webroot/cgi pkg_add ...), and
    rsync it. [Actually I'm exporting it using NFS, and the frontends run rsync
    locally when required to update their local copies against the NFS master]

    Disadvantages:

    - rsyncing a couple of gigs of data is not particularly fast, even when only
    a few files have changed

    - if a sysadmin (wrongly) changes a file on a front-end instead of on the
    master copy in the admin box, then the change will be lost when the next
    rsync occurs. They might think they've fixed a problem, and then (say) 24
    hours later their change is wiped. However if this is a config file, the
    fact that the old file has been reinstated might not be noticed until the
    daemon is restarted or the box rebooted - maybe months later. This I think
    is the biggest fundamental problem.

    - files can be added locally and they will remain indefinitely (unless we
    use rsync --delete which is a bit scary). If this is done then adding a new
    machine into the cluster by rsyncing from the master will not pick up these
    extra files.

    So, here are the alternatives I'm considering, and I'd welcome any
    additional suggestions too.

    2. Run the images directly off NFS
    ----------------------------------

    I've had this running before, even the entire O/S, and it works just fine.
    However the NFS server itself then becomes a critical
    single-point-of-failure: if it has to be rebooted and is out of service for
    2 minutes, then the whole cluster is out of service for that time.

    I think this is only feasible if I can build a highly-available NFS server,
    which really means a pair of boxes serving the same data. Since the system
    image is read-only from the point of view of the frontends, this should be
    easy enough:

          frontends frontends
            | | | | | |
             NFS -----------> NFS
           server 1 sync server 2

    As far as I know, NFS clients don't support the idea of failing over from
    one server to another, so I'd have to make a server pair which transparently
    fails over.

    I could make one NFS server take over the other server's IP address using
    carp or vrrp. However, I suspect that the clients might notice. I know that
    NFS is 'stateless' in the sense that a server can be rebooted, but for a
    client to be redirected from one server to the other, I expect that these
    filesytems would have to be *identical*, down to the level of the inode
    numbers being the same.

    If that's true, then rsync between the two NFS servers won't cut it. I was
    thinking of perhaps using geom_mirror plus ggated/ggatec to make a
    block-identical read-only mirror image on NFS server 2 - this also has the
    advantage that any updates are close to instantaneous.

    What worries me here is how NFS server 2, which has the mirrored filesystem
    mounted read-only, will take to having the data changed under its nose. Does
    it for example keep caches of inodes in memory, and what would happen if
    those inodes on disk were to change? I guess I can always just unmount and
    remount the filesystem on NFS server 2 after each change.

    My other concern is about susceptibility to DoS-type attacks: if one
    frontend were to go haywire and start hammering the NFS servers really hard,
    it could impact on all the other machines in the cluster.

    However, the problems of data synchronisation are solved: any change made on
    the NFS server is visible identically to all front-ends, and sysadmins can't
    make changes on the front-ends because the NFS export is read-only.

    3. Use a network distributed filesystem - CODA? AFS?
    ----------------------------------------------------

    If each frontend were to access the filesystem as a read-only network mount,
    but have a local copy to work with in the case of disconnected operation,
    then the SPOF of an NFS server would be eliminated.

    However, I have no experience with CODA, and although it's been in the tree
    since 2002, the README's don't inspire confidence:

       "It is mostly working, but hasn't been run long enough to be sure all the
       bugs are sorted out. ... This code is not SMP ready"

    Also, a local cache is no good if the data you want during disconnected
    operation is not in the cache at that time, which I think means this idea is
    not actually a very good one.

    4. Mount filesystems read-only
    ------------------------------

    On each front-end I could store /webroot/cgi on a filesystem mounted
    read-only to prevent tampering (as long as the sysadmin doesn't remount it
    read-write of course). That would work reasonably well, except that being
    mounted read-only I couldn't use rsync to update it!

    It might also work with geom_mirror and ggated/ggatec, except for the issue
    I raised before about changing blocks on a filesystem under the nose of a
    client who is actively reading from it.

    5. Using a filesystem which really is read-only
    -----------------------------------------------

    Better tamper-protection could be had by keeping data in a filesystem
    structure which doesn't support any updates at all - such as cd9660 or
    geom_uzip.

    The issue here is how to roll out a new version of the data. I could push
    out a new filesystem image into a second partition, but it would then be
    necessary to unmount the old filesystem and remount the new on the same
    place, and you can't really unmount a filesystem which is in use. So this
    would require a reboot.

    I was thinking that some symlink trickery might help:

        /webroot/cgi -> /webroot/cgi1
        /webroot/cgi1 # filesystem A mounted here
        /webroot/cgi2 # filesystem B mounted here

    It should be possible to unmount /webroot/cgi2, dd in a new image, remount
    it, and change the symlink to point to /webroot/cgi2. After a little while,
    hopefully all the applications will stop using files in /webroot/cgi1, so
    this one can be unmounted and a new one put in its place on the next update.
    However this is not guaranteed, especially if there are long-lived processes
    using binary images in this partition. You'd still have to stop and restart
    all those processes.

    If reboots were acceptable, then the filesystem image could also be stored
    in ramdisk pulled in via pxeboot. This makes sense especially for geom_uzip
    where the data is pre-compressed. However I would still prefer to avoid
    frequent reboots if at all possible. Also, whilst a ramdisk might be OK for
    the root filesystem, a typical CGI environment (with perl, php, ruby,
    python, and loads of libraries) would probably be too large anyway.

    6. Journaling filesystem replication
    ------------------------------------

    If the data were stored on a journaling filesystem on the master box, and
    the journal logs were distributed out to the slaves, then they would all
    have identical filesystem copies and only a minimal amount of data would
    need to be pushed out to each machine on each change. (This would be rather
    like NetApps and their snap-mirroring system). However I'm not aware of any
    journaling filesystem for FreeBSD, let alone whether it would support
    filesystem replication in this way.

    Well, that's what I've come up with so far. I'd be very interested to hear
    if people have any other strategies or suggestions, particularly with
    practical experience in a clustered/ISP environment.

    Regards,

    Brian Candler.
    _______________________________________________
    freebsd-isp@freebsd.org mailing list
    http://lists.freebsd.org/mailman/listinfo/freebsd-isp
    To unsubscribe, send any mail to "freebsd-isp-unsubscribe@freebsd.org"


  • Next message: Autoresponder: "Re: Mail Delivery (failure dean@terrabyte.dc.com.au)"

    Relevant Pages

    • Re: NFS file truncating
      ... Goodyear Tire and Rubber Company ... > We have a P655 running as a NFS server. ... > Strange thing is, many times when it fails, the ... > is copying to a filesystem that is the NFS mounted ...
      (AIX-L)
    • Re: NFS file truncating
      ... indirection to double indirection and if you don't ... > We have a P655 running as a NFS server. ... > Strange thing is, many times when it fails, the ... > is copying to a filesystem that is the NFS mounted ...
      (AIX-L)
    • Re: Options for synchronising filesystems
      ... > Ultimately I'd like to synchronise the host OS on each server too. ... Keep a master image on an admin box, and rsync it out to the frontends ... > I think this is only feasible if I can build a highly-available NFS server, ... > remount the filesystem on NFS server 2 after each change. ...
      (freebsd-isp)
    • Re: Distributed network RAID
      ... just use NFS to mount the filesystem. ... rsync needs too long to scan the filesystems and hogs too much CPU ... His best bet are probably cluster file systems, ... Lustre if you need file locking. ...
      (Ubuntu)
    • Re: Panic on my NFS server: Consumer with zero access count in g_dev_strategy
      ... I umount'ed a filesystem of my NFS Server, ... >while an NFS client was writing to it. ... writing to the disk device after it has been closed. ...
      (freebsd-current)