Re: ZFS: filesystem approach
From: Logan Shaw (lshaw-usenet_at_austin.rr.com)
Date: Sat, 19 Nov 2005 20:17:12 GMT
Joerg Schilling wrote:
> In article <email@example.com>,
> Dan Foster <firstname.lastname@example.org> wrote:
>>The suggested approach is one filesystem per user.
>>While that's great on the resource control angle (e.g. quotas), I'm
>>wondering if Sun has tested ZFS on systems with 1/2 million users, each
>>with their own ZFS filesystem, to see how usable it would actually be?
> There are other things that need to be proved in practical tests:
> - What about backups?
> Will there be a super filesystem mount that could be used
> as the source to be backed up?
With zfs, it looks like everything within one pool is named
hierarchically unless you override that by setting the
"mountpoint" property for an individual filesystem (in which
case the override will apply to that filesystem and everything
below it in the zfs pool's hierarchy).
At http://www.opensolaris.org/os/community/zfs/demos/ , the
first demo shows an example of creating a "home" filesystem
within a zfs pool, then putting separate filesystems for each
user within that filesystem. That is, they create the
mypool/home filesystem, and then within that they create
mypool/home/ann, mypool/home/bob, and mypool/home/carol.
Then they do "zfs set mountpoint=/users mypool/home", and
this moves everything so that you have /users/ann and so on.
The point is, if you follow this strategy, then within the
vfs namespace, you will have all home directories under a
single node, so that if you wanted to use a traditional backup
tool to back them all up in one archive, you could so.
Having said that, zfs supports instantaneous snapshots with
virtually no overhead, and the snapshots appear in a special
directory within the filesystem whose snapshot has been taken.
This could be a very helpful thing for taking backups, since
it will tend to be closer to consistent than a backup taken
over the course of several minutes or hours. (Indeed, you may
be able to take snapshots in single-user mode and then boot up
and do a backup later, in which case they'd be perfect for
But the point is that since each snapshot is within a particular
subdirectory, that makes it non-trivial to back up several
filesystems together with a traditional tool like tar (or some
third-party variant of tar). Instead of this:
tar cf $TAPE .
it's going to be something more like this:
for u in ann bob carol
zfs snapshot mypool/home/"$u"@backup
for u in ann bob carol
xargs tar cf $TAPE
for u in ann bob carol
zfs destroy mypool/home/"$u"@backup
However, even that will put ugly paths like
into the archive, which isn't the greatest thing, especially
if you are trying to restore a full system. It should be
possible with some tools to reparent all that to eliminate
the .zfs/snapshop/backup part of the path (preferably when
creating the archive, but later if necessary) but it'll be ugly.
On the subject of backups and zfs and snapshots, I noticed
one interesting thing about the built-in zfs backup functionality.
It seems that incremental backups are handled by taking two
snapshots. You take a snapshot when you do the full backup,
and then you take another when it's time to do the incremental.
Then, in effect, the backup tool takes a "diff" between the
two snapshots and decides what to back up in the manner. Surely
they have built-in stuff to make this fast, but there is one
potential gotcha: it looks like you'll have to keep the full
snapshot around to even be able to do an incremental. That
could be a problem since deleted files' storage does not become
free as long as they are present in a snapshot, which means
that if you are running low on disk space, there may be
situations where you have to sacrifice the ability to do
incremental backups in order to get the disk usage down!
Still, it's not too terrible a problem since (a) disk is
cheap these days, and (b) if you are deleting a bunch of
files and then filling up the disk again, that means it is
time to do a full backup soon anyway. But still, I predict
people will find it annoying when this forces their hand
and they find that full backups become an urgent thing in
As I understand it, the present ufsdump system avoids this
problem altogether by identifying which files are to go
into an incremental through the use of timestamps. That
presents another problem: if you only back up the files
which have been created or changed since the last backup,
you lose information about files which have been deleted
since then. That problem is easily solved by including a
list of all files that are present on the filesystem in
the incremental dump, though: this would allow the restore
tool to preserve the deletes. I could be wrong, but I
think that is how ufsdump/ufsrestore handle it. Anyway,
the point is that that this "old" style of backups is
less efficient at creating the incremental (the incremental
requires more space and time to create), but it is more
efficient because it doesn't require you to keep around
disk-wasting snapshots. So, it would seem that, if I
understand them both correctly, the new zfs backup method
is not the best of both worlds, and in fact when switching
from ufsdump to "zfs backup", you are gaining some things
and losing others.
> - How will inode numbers be?
> If there is a super root, then the root dir inode # of each FS
> will certainly not be 2.
It seems like it could be either way. Certainly for each
zfs pool there must be a root. But, the presentation at
makes it look like (especially on page 12) that each filesystem
within a pool has its own "uber block". For example, a snapshot
(or a clone) is created by starting a new "uber block" that
points to the same data.
Also, if zfs is truly without arbitrary limitations, it seems
like addresses of i-nodes (or whatever the equivalent is) would
need to be larger than 64-bit. That means they're larger than
an ino_t even on 64-bit systems, which would further imply
that i-node numbers may be generated synthetically. In which
case, it is no problem to make sure the root i-node for a
zfs filesystem is #2. :-)
However, I do agree that tools that make stupid, undocumented
assumptions about i-node numbers are dumb and deserve to fail. :-)