RE: Gvinum RAID5 performance

freebsd_at_newmillennium.net.au
Date: 11/07/04

  • Next message: Yann Berthier: "Re: ppp panic: 3rd try"
    To: "'Greg 'groggy' Lehey'" <grog@FreeBSD.org>, "'Lukas Ertl'" <le@FreeBSD.org>
    Date: Sun, 7 Nov 2004 12:06:26 +1100
    
    

    > -----Original Message-----
    > From: Greg 'groggy' Lehey [mailto:grog@FreeBSD.org]
    > Sent: Sunday, 7 November 2004 10:23 AM
    > To: Lukas Ertl
    > Cc: freebsd@newmillennium.net.au; freebsd-current@FreeBSD.org
    > Subject: Re: Gvinum RAID5 performance
    >
    > 1. Too small a stripe size. If you (our anonymous user, who was
    > using a single dd process) have to perform multiple transfers for
    > a single request, the results will be slower.

    I'm using the recommended 279kb from the man page.

    > 2. There may be some overhead in GEOM that slows things down. If
    > this is the case, something should be done about it.

    (Disclaimer: I have only looked at the code, not put in any debugging to
    verify the situation. Also, my understanding is that the term "stripe"
    refers to the data in a plex which when read sequentially results in all
    disks being accessed exactly once, i.e. "A(n) B(n) C(n) P(n)" rather
    than blocks from a single subdisk, i.e. "A(n)", where (n) represents a
    group of contiguous blocks. Please correct me if I am wrong)

    I can see a pontential place for slowdown here . . .

    In geom_vinum_plex.c, line 575

    /*
     * RAID5 sub-requests need to come in correct order, otherwise
     * we trip over the parity, as it might be overwritten by
     * another sub-request.
     */
    if (pbp->bio_driver1 != NULL &&
        gv_stripe_active(p, pbp)) {
            /* Park the bio on the waiting queue. */
            pbp->bio_cflags |= GV_BIO_ONHOLD;
            bq = g_malloc(sizeof(*bq), M_WAITOK | M_ZERO);
            bq->bp = pbp;
            mtx_lock(&p->bqueue_mtx);
            TAILQ_INSERT_TAIL(&p->wqueue, bq, queue);
            mtx_unlock(&p->bqueue_mtx);
    }

    It seems we are holding back all requests to a currently active stripe,
    even if it is just a read and would never write anything back. I think
    the following conditions should apply:

    - If the current transactions on the stripe are reads, and we want to
    issue another read, let it through
    - If the current transactions on the stripe are reads, and we want to
    issue a write, queue it
    - If the current transactions on the stripe are writes, and we want to
    issue another write, queue it (but see below)
    - If the current transactions on the stripe are writes, and we want to
    issue a read, queue it if it overlaps the data being written, or if the
    plex is degraded and the request requires the parity to be read,
    otherwise, let it through

    We could also optimize writing a bit by doing the following:

    1. To calculate parity, we could simply read the old data (that was
    about to be overwritten), and the old parity, and recalculate the parity
    based on that information, rather than reading in all the stripes (based
    on the assumption that the original parity was correct). This would
    still take approximately the same amount of time, but would leave the
    other disks in the stripe available for other I/O.

    2. If there are two or more writes pending for the same stripe (that is,
    up to the point that the data|parity has been written), they should be
    condensed into a single operation so that there is a single write to the
    parity, rather than one write for each operation. This way, we should be
    able to get close to (N -1) * disk throughput for large sequential
    writes, without compromising the integrity of the parity on disk.

    3. When calculating parity as per (2), we should operate on whole blocks
    (as defined by the underlying device). This provides the benefit of
    being able to write a complete block to the subdisk, so the underlying
    mechanism does not have to do a read/update/write operation to write a
    partial block.

    Comments?

    -- 
    Alastair D'Silva           mob: 0413 485 733
    Networking Consultant      fax: 0413 181 661
    New Millennium Networking  web: http://www.newmillennium.net.au
    _______________________________________________
    freebsd-current@freebsd.org mailing list
    http://lists.freebsd.org/mailman/listinfo/freebsd-current
    To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"
    

  • Next message: Yann Berthier: "Re: ppp panic: 3rd try"

    Relevant Pages

    • [PATCH 05/16] md: add raid5_run_ops and support routines
      ... Prepare the raid5 implementation to use async_tx for running stripe ... biofill (copy data into request buffers to satisfy a read request) ... biodrain (copy data out of request buffers to satisfy a write request) ... postxor (recalculate parity for new data that has entered the cache) ...
      (Linux-Kernel)
    • [PATCH 03/12] md: add raid5_run_ops and support routines
      ... Prepare the raid5 implementation to use async_tx for running stripe ... biofill (copy data into request buffers to satisfy a read request) ... biodrain (copy data out of request buffers to satisfy a write request) ... postxor (recalculate parity for new data that has entered the cache) ...
      (Linux-Kernel)
    • [PATCH 2.6.20-rc5 03/12] md: add raid5_run_ops and support routines
      ... Prepare the raid5 implementation to use async_tx for running stripe ... biofill (copy data into request buffers to satisfy a read request) ... biodrain (copy data out of request buffers to satisfy a write request) ... postxor (recalculate parity for new data that has entered the cache) ...
      (Linux-Kernel)
    • [PATCH 2.6.21-rc4 04/15] md: add raid5_run_ops and support routines
      ... Prepare the raid5 implementation to use async_tx for running stripe ... biofill (copy data into request buffers to satisfy a read request) ... biodrain (copy data out of request buffers to satisfy a write request) ... postxor (recalculate parity for new data that has entered the cache) ...
      (Linux-Kernel)
    • Re: IOPS from RAID units
      ... group less the hot spares and parity drives. ... As the write size approaches the stripe size, ... and the matching old parity pair), and then a pair of two sector ... off the IBM site, it's pretty reasonable to assume IBM will jump through ...
      (comp.arch.storage)