Re: My experies with gvirstor



Patrick Tracanelli wrote:
Here is what I got as my experiences with gvirstor so far.

With kern.geom.virstor.debug=15 get really slow. While accessing
(newfs'ing) /dev/virstor/home system starts to get 98% of CPU cycles. No
problem after all, just mentioning in case this shouldnt happen.

I don't see how to avoid it, since this mode generates a real mountain
of messages :)

..
3542977888, 3543354240, 3543730592, 3544106944, 3544483296, 3544859648,
3545236000, 3545612352, 3545988704, 3546365056,
3546741408, 3547117760, 3547494112,

and it STOPS. Checking for debug I can find out that BIO Delaying is
working, because I get:

GEOM_VIRSTOR[1]: All physical space allocated for home
GEOM_VIRSTOR[2]: Delaying BIO (size=65536) until free physical space can
be found on virstor/home

Ok, this is because UFS creates cylindar groups across the drive, and
though they are basically small, each of them allocates an entire 4 MB
chunk. Thus the problems.

If I ./gvirstor add home ad4s1, things start to work back. But ad4s1 is
way too small and Is not enough. But at least a 1G device I can create:

/dev/virstor/home 946G 4.0K 870G 0% /usr/home4

Nice.

So my question, how can I make the math to find out how much real space
I will need to create a gvirstor device sized N?

# ./gvirstor status home
Name Status Components
virstor/home 43% physical free ad2s1

Since it is a 40GB devie, something close to 34GB was used to store
structure of a 1TB device. Is this usage related to the chunk size?

The idea that is to be followed here is (chunk_size * number_of_cgs) is
the smallest physical space required for newfs to finish. I'll have to
find out the formula by which newfs caluclates how many cylindar groups
it wants to create to give you a precise answer.

It would be very interesting if you choose a different chunk size than
the default 4 MB. For example, try using chunks of 512 KB (gvirstor
create -m 512 ...).

Also, if you're in the mood for it, try benchmarking one and the other
chunk size (both newfs time, and bonnie++) would be interesting.

however, if I export gvirstor device, the other side (ggate client) can
only import it if it is umounted in the local machine (the one where
gvirstor resides):

/dev/ggate0 946G 4.0K 870G 0% /mnt

If I try to mount it, I get:

# mount /dev/virstor/home /usr/home4
mount: /dev/virstor/home: Operation not permitted

This is the limitation of the UFS file system, not GEOM & its classes.
You can search the archives for many lamentations about how people are
missing a real distributed & concurrent file system in FreeBSD.

That´s bad fun :( I thought I could do more lego play. This seems like
the same problem I had in the past, trying to export a mounted gmirror
device.

Yes, it's the same problem.

iostat -w1 ad0 ad2

I can see there is no performance difference comparing writings to the
ad provider or to a gvirstor provider. I can also see that the disk
usage is one provider each time. I only get activiry on ad0 when ad2 has
ended up its space. gstat shows me the same thing.

Yes, it will fill up the virstor device one drive at a time, in order in
which they have been added. If you want multiple devices to be used at
the same time, you'll have to add a gstripe "lego brick" in the setup :)

However, let me ask something. Is metadata information updated
synchronously?

Yes for the virstor. Virstor metadata needs only to be updated when a
new physical block is allocated.

For example, let's assume a virstor device that has 5 chunks of 1 MB:

[ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ]

So, the above line represents 5 MB of virtual storage. Let's say an
application (and, for this argument, this includes the file system),
writes one byte to the position "2 MB". Now the second chunk gets its
physical backing and virstor metadata is written to reflect this. The
new situation is:

[ 1 ] [.2.] [ 3 ] [ 4 ] [ 5 ]

(the dots represent a virtual chunk with physical backing). Now, when
the application writes a single byte at position "2 MB + 1", that byte
gets written to the same already allocated second chunk, so there's no
need to allocate another chunk, and there's no need to write virstor
metadata (AKA "the allocation table"). But when an application writes to
position "4 MB", another chunk gets physical backing:

[ 1 ] [.2.] [ 3 ] [.4.] [ 5 ]

etc., etc. The unallocated chunks remain as "holes" - reading from them
produces bytes out of thin air (AKA "the infinite supply of zeroes"),
and writing to them allocates chunks (if they aren't already allocated)
and writes to physical storage.

I ask it because removing /usr/home4/40G.bin (rm /usr/home4/40G.bin)
takes about 1 and a half minute to finish (newfs was made with -U flag).

Ok, now things get to be interesting. Let's see how the allocation goes
from the physical device side. Let's assume we have a drive that can
hold 2 MB. That is, two chunks of 1 MB:

{ 1 } { 2 }

When the first allocation in the above example happens, virtual chunk
[2] is mapped to physical chunk {1}, and now we have:

{.1.} { 2 }

When the second allocation happens, virtual chunk [4] is mapped to
physical chunk {2}:

{.1.} {.2.}

And the mapping table contains something like:

[1]->? [2]->{1} [3]->? [4]->{2}

The reason why newfs creates cylinder groups is speed. Cylinder groups
group "nearby" files, where "nearby" is determined by some heuristics,
including "belonging to the same directory".

One cylindar group is usually somewhere around 200 MB in size. This
means that it can hold 200 MB of files in a small area on the hard drive
platter, so that jumping from one file to the next involves very little
seeking. When using virstor, the space occupied by a single cylinder
group suddenly becomes scattered around the hard drive platter,
defeating the purpose of grouping, and introducing much more seeks.
(This is a bit simplified, but correct in principle)

There are three ways to "fix" this:

1. Use huge chunk sizes, like 200 MB. (but cg size also cannot be
reliably calculated in advance, and huge chunk sizes will badly
influence the "savings" in storage virstor device can provide)
2. Use a medium that doesn't have seek penalties, such as solid state
memory (flash drives)
3. Use gjournal, and set the journal on a non-virstor device. This way,
most writes (and unlink() calls have lots of writes) will go to the
journal device first. (gjournal is available only in 7-CURRENT)


Attachment: signature.asc
Description: OpenPGP digital signature



Relevant Pages

  • [PATCH 7/8] percpu, module: implement reserved allocation and use it for module percpu variables
    ... This patch implements reserved allocation from the first chunk. ... for module static percpu variables on architectures with limited ... * reserve after the static area in the first chunk. ...
    (Linux-Kernel)
  • Re: [PATCH 09/10] percpu: implement new dynamic percpu allocator
    ... Is it customary to dump stack on allocation failure? ... Size of free space in the chunk. ... A designed decision has been made to not permit the caller to specify ... The design decision was inherited from the original percpu allocator. ...
    (Linux-Kernel)
  • Re: #tj-percpu has been rebased
    ... Percpu areas are allocated in chunks in vmalloc area. ... Each chunk is ... Allocation is done in offset-size areas of single unit space. ... Currently it uses pte mappings but byn using larger UNIT_SIZE, ...
    (Linux-Kernel)
  • New jemalloc patch (was Re: KDE 3.5.0 seems much chubbier than 3.4.2)
    ... I've looked into this in some detail, and have determined that KDE apps exhibit an allocation pattern that causes jemalloc to fragment memory somewhat badly. ... Runs are used directly for allocations that are larger than 1/2 page, but no larger than 1/2 chunk. ... Memory usage is much improved, with one exception: small apps tend to fault in a few more pages than before, since even a single allocation of a size class causes a page to be faulted in. ...
    (freebsd-current)
  • Re: Understanding the performance difference between class specific operator new and the global one
    ... Just a bit like writting a small part of the heap manager algorithm ourselves. ... It says that "The usual way to address allocation performance for a specific ... fixed-size allocators can be made much more efficient than general-purpose ... no "current chunk", or the current chunk has been exhausted, it would ...
    (microsoft.public.vc.language)