Re: Options for synchronising filesystems

From: filip wuytack (filip_at_wuytack.net)
Date: 09/26/05

  • Next message: Eric Anderson: "Re: Options for synchronising filesystems"
    Date: Mon, 26 Sep 2005 13:25:11 +0100
    To: Eric Anderson <anderson@centtech.com>
    
    

    Eric Anderson wrote:
    > Brian Candler wrote:
    >
    >> Hello,
    >>
    >> I was wondering if anyone would care to share their experiences in
    >> synchronising filesystems across a number of nodes in a cluster. I can
    >> think
    >> of a number of options, but before changing what I'm doing at the
    >> moment I'd
    >> like to see if anyone has good experiences with any of the others.
    >>
    >> The application: a clustered webserver. The users' CGIs run in a chroot
    >> environment, and these clearly need to be identical (otherwise a CGI
    >> running
    >> on one box would behave differently when running on a different box).
    >> Ultimately I'd like to synchronise the host OS on each server too.
    >>
    >> Note that this is a single-master, multiple-slave type of filesystem
    >> synchronisation I'm interested in.
    >>
    >>
    >> 1. Keep a master image on an admin box, and rsync it out to the frontends
    >> -------------------------------------------------------------------------
    >>
    >> This is what I'm doing at the moment. Install a master image in
    >> /webroot/cgi, add packages there (chroot /webroot/cgi pkg_add ...), and
    >> rsync it. [Actually I'm exporting it using NFS, and the frontends run
    >> rsync
    >> locally when required to update their local copies against the NFS
    >> master]
    >>
    >> Disadvantages:
    >>
    >> - rsyncing a couple of gigs of data is not particularly fast, even
    >> when only
    >> a few files have changed
    >>
    >> - if a sysadmin (wrongly) changes a file on a front-end instead of on the
    >> master copy in the admin box, then the change will be lost when the next
    >> rsync occurs. They might think they've fixed a problem, and then (say) 24
    >> hours later their change is wiped. However if this is a config file, the
    >> fact that the old file has been reinstated might not be noticed until the
    >> daemon is restarted or the box rebooted - maybe months later. This I
    >> think
    >> is the biggest fundamental problem.
    >>
    >> - files can be added locally and they will remain indefinitely (unless we
    >> use rsync --delete which is a bit scary). If this is done then adding
    >> a new
    >> machine into the cluster by rsyncing from the master will not pick up
    >> these
    >> extra files.
    >>
    >> So, here are the alternatives I'm considering, and I'd welcome any
    >> additional suggestions too.
    >
    >
    > Here's a few ideas on this: do multiple rsyncs, one for each top level
    > directory. That might speed up your total rsync process. Another
    > similar method is using a content revisioning system. This is only good
    > for some cases, but something like subversion might work ok here.
    >
    >
    >
    >> 2. Run the images directly off NFS
    >> ----------------------------------
    >>
    >> I've had this running before, even the entire O/S, and it works just
    >> fine.
    >> However the NFS server itself then becomes a critical
    >> single-point-of-failure: if it has to be rebooted and is out of
    >> service for
    >> 2 minutes, then the whole cluster is out of service for that time.
    >>
    >> I think this is only feasible if I can build a highly-available NFS
    >> server,
    >> which really means a pair of boxes serving the same data. Since the
    >> system
    >> image is read-only from the point of view of the frontends, this
    >> should be
    >> easy enough:
    >>
    >> frontends frontends
    >> | | | | | |
    >> NFS -----------> NFS
    >> server 1 sync server 2
    >>
    >> As far as I know, NFS clients don't support the idea of failing over from
    >> one server to another, so I'd have to make a server pair which
    >> transparently
    >> fails over.
    >>
    >> I could make one NFS server take over the other server's IP address using
    >> carp or vrrp. However, I suspect that the clients might notice. I know
    >> that
    >> NFS is 'stateless' in the sense that a server can be rebooted, but for a
    >> client to be redirected from one server to the other, I expect that these
    >> filesytems would have to be *identical*, down to the level of the inode
    >> numbers being the same.
    >>
    >> If that's true, then rsync between the two NFS servers won't cut it. I
    >> was
    >> thinking of perhaps using geom_mirror plus ggated/ggatec to make a
    >> block-identical read-only mirror image on NFS server 2 - this also has
    >> the
    >> advantage that any updates are close to instantaneous.
    >>
    >> What worries me here is how NFS server 2, which has the mirrored
    >> filesystem
    >> mounted read-only, will take to having the data changed under its
    >> nose. Does
    >> it for example keep caches of inodes in memory, and what would happen if
    >> those inodes on disk were to change? I guess I can always just unmount
    >> and
    >> remount the filesystem on NFS server 2 after each change.
    >
    >
    > I've tried doing something similar. I used fiber attached storage, and
    > had multiple hosts mounting the same partition. It seemed as though
    > when host A mounted the filesystem read-write, and then host B mounted
    > it read-only, any changes made by host A were not seen by B, and even
    > remounting did not always bring it up to current state. I believe it
    > has to do with the buffer cache and host A's desire to keep things (like
    > inode changes, block maps, etc) in cache and not write them to disk.
    > FreeBSD does not currently have a multi-system cache coherency protocol
    > to distribute that information to other hosts. This is something I
    > think would be very useful for many people. I suppose you could just
    > mount the filesystem when you know a change has happened, but you still
    > may not see the change. Maybe mounting the filesystem on host A with
    > the sync option would help.
    >
    >> My other concern is about susceptibility to DoS-type attacks: if one
    >> frontend were to go haywire and start hammering the NFS servers really
    >> hard,
    >> it could impact on all the other machines in the cluster.
    >>
    >> However, the problems of data synchronisation are solved: any change
    >> made on
    >> the NFS server is visible identically to all front-ends, and sysadmins
    >> can't
    >> make changes on the front-ends because the NFS export is read-only.
    >
    >
    > This was my first thought too, and a highly available NFS server is
    > something any NFS heavy installation wants (needs). There are a few
    > implementations of clustered filesystems out there, but non for FreeBSD
    > (yet). What that allows is multiple machines talking to a shared
    > storage with read/write access. Very handy, but since you only need
    > read-only access, I think your problem is much simpler, and you can get
    > away with a lot less.
    >
    >
    >> 3. Use a network distributed filesystem - CODA? AFS?
    >> ----------------------------------------------------
    >>
    >> If each frontend were to access the filesystem as a read-only network
    >> mount,
    >> but have a local copy to work with in the case of disconnected operation,
    >> then the SPOF of an NFS server would be eliminated.
    >>
    >> However, I have no experience with CODA, and although it's been in the
    >> tree
    >> since 2002, the README's don't inspire confidence:
    >>
    >> "It is mostly working, but hasn't been run long enough to be sure
    >> all the
    >> bugs are sorted out. ... This code is not SMP ready"
    >>
    >> Also, a local cache is no good if the data you want during disconnected
    >> operation is not in the cache at that time, which I think means this
    >> idea is
    >> not actually a very good one.
    >
    >
    > There is also a port for coda. I've been reading about this, and it's
    > an interesting filesystem, but I'm just not sure of it's usefulness yet.
    >
    >
    >> 4. Mount filesystems read-only
    >> ------------------------------
    >>
    >> On each front-end I could store /webroot/cgi on a filesystem mounted
    >> read-only to prevent tampering (as long as the sysadmin doesn't
    >> remount it
    >> read-write of course). That would work reasonably well, except that being
    >> mounted read-only I couldn't use rsync to update it!
    >>
    >> It might also work with geom_mirror and ggated/ggatec, except for the
    >> issue
    >> I raised before about changing blocks on a filesystem under the nose of a
    >> client who is actively reading from it.
    >
    >
    > I suppose you could mount r/w only when doing the rsync, then switch
    > back to ro once complete. You should be able to do this online, without
    > any issues or taking the filesystem offline.
    >
    >
    >> 5. Using a filesystem which really is read-only
    >> -----------------------------------------------
    >>
    >> Better tamper-protection could be had by keeping data in a filesystem
    >> structure which doesn't support any updates at all - such as cd9660 or
    >> geom_uzip.
    >>
    >> The issue here is how to roll out a new version of the data. I could push
    >> out a new filesystem image into a second partition, but it would then be
    >> necessary to unmount the old filesystem and remount the new on the same
    >> place, and you can't really unmount a filesystem which is in use. So this
    >> would require a reboot.
    >>
    >> I was thinking that some symlink trickery might help:
    >>
    >> /webroot/cgi -> /webroot/cgi1
    >> /webroot/cgi1 # filesystem A mounted here
    >> /webroot/cgi2 # filesystem B mounted here
    >>
    >> It should be possible to unmount /webroot/cgi2, dd in a new image,
    >> remount
    >> it, and change the symlink to point to /webroot/cgi2. After a little
    >> while,
    >> hopefully all the applications will stop using files in /webroot/cgi1, so
    >> this one can be unmounted and a new one put in its place on the next
    >> update.
    >> However this is not guaranteed, especially if there are long-lived
    >> processes
    >> using binary images in this partition. You'd still have to stop and
    >> restart
    >> all those processes.
    >>
    >> If reboots were acceptable, then the filesystem image could also be
    >> stored
    >> in ramdisk pulled in via pxeboot. This makes sense especially for
    >> geom_uzip
    >> where the data is pre-compressed. However I would still prefer to avoid
    >> frequent reboots if at all possible. Also, whilst a ramdisk might be
    >> OK for
    >> the root filesystem, a typical CGI environment (with perl, php, ruby,
    >> python, and loads of libraries) would probably be too large anyway.
    >>
    >>
    >> 6. Journaling filesystem replication
    >> ------------------------------------
    >>
    >> If the data were stored on a journaling filesystem on the master box, and
    >> the journal logs were distributed out to the slaves, then they would all
    >> have identical filesystem copies and only a minimal amount of data would
    >> need to be pushed out to each machine on each change. (This would be
    >> rather
    >> like NetApps and their snap-mirroring system). However I'm not aware
    >> of any
    >> journaling filesystem for FreeBSD, let alone whether it would support
    >> filesystem replication in this way.
    >
    >
    > There is a project underway for UFSJ (UFS journaling). Maybe once it
    > is complete, and bugs are ironed out, one could implement a journal
    > distribution piece to send the journal updates to multiple hosts and
    > achieve what you are thinking, however, that only distributes the
    > meta-data, and not the actual data.
    >
    >
    Have a look at dragonfly BSD for this. They are working on a journaling
    filesystem that will do just that.

    ~ Fil

    > Good luck finding your ultimate solution!
    >
    > Eric
    >
    >

    _______________________________________________
    freebsd-isp@freebsd.org mailing list
    http://lists.freebsd.org/mailman/listinfo/freebsd-isp
    To unsubscribe, send any mail to "freebsd-isp-unsubscribe@freebsd.org"


  • Next message: Eric Anderson: "Re: Options for synchronising filesystems"

    Relevant Pages

    • Re: NFS file truncating
      ... Goodyear Tire and Rubber Company ... > We have a P655 running as a NFS server. ... > Strange thing is, many times when it fails, the ... > is copying to a filesystem that is the NFS mounted ...
      (AIX-L)
    • Re: NFS file truncating
      ... indirection to double indirection and if you don't ... > We have a P655 running as a NFS server. ... > Strange thing is, many times when it fails, the ... > is copying to a filesystem that is the NFS mounted ...
      (AIX-L)
    • Options for synchronising filesystems
      ... synchronising filesystems across a number of nodes in a cluster. ... Keep a master image on an admin box, and rsync it out to the frontends ... I think this is only feasible if I can build a highly-available NFS server, ... What worries me here is how NFS server 2, which has the mirrored filesystem ...
      (freebsd-isp)
    • Re: Options for synchronising filesystems
      ... > Ultimately I'd like to synchronise the host OS on each server too. ... Keep a master image on an admin box, and rsync it out to the frontends ... > I think this is only feasible if I can build a highly-available NFS server, ... > remount the filesystem on NFS server 2 after each change. ...
      (freebsd-isp)
    • Re: Panic on my NFS server: Consumer with zero access count in g_dev_strategy
      ... I umount'ed a filesystem of my NFS Server, ... >while an NFS client was writing to it. ... writing to the disk device after it has been closed. ...
      (freebsd-current)