Re: Some initial postmark numbers from a dual-PIII+ATA, 4.x and 6.x

From: Robert Watson (rwatson_at_FreeBSD.org)
Date: 02/06/05

  • Next message: Andrey Smagin: "My notices about ATA performance"
    Date: Sun, 6 Feb 2005 14:43:46 +0000 (GMT)
    To: Jeremie Le Hen <jeremie@le-hen.org>
    
    

    On Sun, 6 Feb 2005, Jeremie Le Hen wrote:

    > Hi Robert,
    >
    > > This would seem to place it closer to 4.x than 5.x -- possibly a property
    > > of a lack of preemption. Again, the differences here are so small it's a
    > > bit difficult to reason using them.
    >
    > Thanks for the result. I'm quite dubitative now : I thought this was a
    > fact that RELENG_5 have worse performances than RELENG_4 for the moment,
    > partly due to lack of micro-optimizations. There have been indeed
    > numerous reports about weak performances on 5.x. Seeing your results,
    > it appears that RELENG_4, RELENG_5 and CURRENT are in fact very close.
    > What should we think then ?

    You should think that benchmark results are a property of several factors:

    - Work load
    - Software baseline
    - Hardware configuration
    - Software configuration
    - Experimental method
    - Effectiveness of documentation

    Let's evaluate each:

    - The workload was postmark in a relatively stock configuration. I
      selected a smaller number of transactions than some other reporters,
      based on the fact that my hardware is quite a bit slower and I wanted to
      try and get coverage of a number of versions. I selected a 90-ish
      second run. The postmark benchmark is basically about effective
      caching, efficient I/O processes, and how the file system managed
      meta-data.

    - Software baseline: I selected to run with 4.x, 5.x, and 6.x kernels, all
      configured for "production" use. I.e., no debugging features enabled.
      I also used a statically compiled 4.x postmark binary for all tests on
      any versions, to try and avoid the effects of compiler changes, etc. I
      was primarily interested in evaluating the performance of the kernel as
      a variable.

    - Hardware configuration: I'm using somewhat dated PIII MP hardware with a
      relatively weak I/O path. It was the hardware on-hand and easily
      preemptible. The hardware has pretty good CPU:I/O performance, meaning
      that with many interesting workloads, the work will be I/O-bound, not
      CPU-bound. It becomes a question of feeding the CPUs and keeping the
      available I/O path used effectively.

    - Software configuration: I network booted the kernel, and used one of two
      user spaces on disk -- a 4.x world and a 6.x world. However, I used a
      single shared UFS1 partition for the postmark target. My hope was that
      static linkiing would eliminate issues involving library changes, and
      that using the same file system partition would help reduce disk
      location effects (note that disk performance varies substantially based
      on the location of data on the platter -- if you lay out a disk into
      several partitions, they will have quite different performance
      properties -- often in excess of the measurable experimental results of
      the property you're testing for). However, as a result I used UFS1 for
      both tests, which is not the default install configuration for FreeBSD
      5.x and 6.x.

    - Experimental method: I attempted to control additional variables in as
      much as possible. However, I used a small number of runs per
      configuration: two. I selected that number to illustrate whether there
      were caching effects in play between multiple runs without reboots. The
      numbers suggest slight caching effects, but not huge ones. The numbers
      weren't large enough to give a sampling distribution that could be
      analyzed -- on the other hand, they were relatively long runs resulting
      in "mean results", meaning that we benefited from a sampling effect and
      a smoothing effect by virtue of the experiment design. To run this
      experiment properly, you'd want to distinguish the caching/non-caching
      cases better, control the time between runs better, and have larger
      samples. In order to try to explain the results I got, I waved my hands
      at CPU cost, and will go into that some more below. I did not test the
      CPU load during the experiment in a rigorous or reproduceable way.

    - Effectiveness of documentation: my experiment was documented, although
      not in great detail. I neglected to document the version of postmark
      (1.5c), the partition layout details, and the complete configuration
      details. I've included more here.

    In my original results post, I demonstrated that, subject to the
    conditions of the tests (documented above and previously), FreeBSD 5.x/6.x
    performance was in line with 4.x performance, or perhaps marginally
    faster. This surprised me also: I expected to see a 5%-10% performance
    drop on UP based on increased overhead, and hoped for a moderate
    measurable SMP performance gain relative to 4.x. On getting the results I
    did, I reran a couple of sample cases -- specifically, 4.x and 6.x kernels
    on SMP with some informal measurement of system time. I concluded that
    the systems were basically idle throughout the tests, which was a likely
    result of the I/O path being the performance bottleneck. It's likely that
    the slight performance improvement between 4.x and 6.x relates to
    preemption and the ability to turn around I/O's in the ATA driver faster,
    or maybe some minor pipelining effect in GEOM or such. It would be
    interesting to know what it is that makes 6.x faster, but it may be hard
    to find out given the amount of change in the system.

    I also informally concluded that 6.x was seeing a higher percentage system
    time than 4.x. This result needs to be investigated properly in an
    experiment of its own, since it was based on informal watching of %system
    in systat, combined with a subjective observation that the numbers
    appeared bigger. An experiment involving the use of time(1) would be a
    good place to start. What's interesting about this informal observation
    (not a formal experimental conclusion!) is that it might explain the
    differing postmark result from some of the other reporters. The system I
    tested on has a decent CPU oomph, but it's relatively slow ATA drive
    technology -- not a RAID, not UDMA100, etc. So if a bit more CPU was
    burned to get slightly more efficient use of the I/O channel, then that
    was immediately visible as a positive factor. On systems with much
    stronger I/O capabilities, perhaps to the point of being CPU-bound, that
    can hurt rather than help, as there are fewer resources available to
    support the critical path.

    Another point that may have helped my configuration is that it ran on a
    PIII, where the relative costs of synchronization primitives are much
    lower. A few months ago, I ran a set of micro-benchmarks that
    illustrated that on the P4 architecture, synchronization primitives are
    many times more expensive than regular operations when compared with
    previous architectures. It could be that the instruction blend came out
    "net worse" in the 5.x/6.x systems on P4-based hardware.

    Another point in the favor of the configuration I was running in was that
    the ATA driver is MPSAFE. This means its interrupt handler is able to
    preempt most running code, and that it can execute effectively in parallel
    against other parts of the kernel (including the file system). Several of
    the reported results were on the twe storage adapter, which does not have
    that property. Last night, Scott Long mailed me patches to fix dumping on
    twe, and also make it MPSAFE. I hope to run some stability testing on
    that, and then hopefully we can get those patches into the hands of people
    doing performance testing with twe and see if they help. FWIW, similar
    changes on amr and ips have resulted in substantial I/O improvements,
    primarily by increasing the number of transactions per second throughput
    by reducing latency in processing the I/O transactions. It's easy to
    imagine this having a direct effect on a benchmark that is really a
    measure of meta-data transaction throughput.

    Finally, my slightly hazy recollection of earlier posts was that postmark
    generally illustrated somewhat consistent performance between FreeBSD
    revisions (excepting NFS async breakage), but that Linux seemed to tromp
    all over on meta-data operations. There was some hypothesizing by Matt
    and Poul-Henning that this was a result of having what Poul-Henning refers
    to as a "Lemming Syncer" -- i.e., a design issue in the way we stream data
    to disk.

    Robert N M Watson

    _______________________________________________
    freebsd-performance@freebsd.org mailing list
    http://lists.freebsd.org/mailman/listinfo/freebsd-performance
    To unsubscribe, send any mail to "freebsd-performance-unsubscribe@freebsd.org"


  • Next message: Andrey Smagin: "My notices about ATA performance"

    Relevant Pages

    • Re: Question about the history of EXEC 8, OS 1100, OS 2200, etc.
      ... almost any configuration and was enough of a system to allow doing a sysgen ... 1106 at the LA Data Center, and I don't remember any such beast as a starter Exec for the pre-2200/900 systems. ... arbitrary hardware configuration. ... While I can forgive the original developers for the design of program files, I don't think there is any good reason/justification for the ugly hacks done since that have resulted in program files, large program files, and large element program files. ...
      (comp.sys.unisys)
    • 2.6.18.3 Lockup on Athlon MP
      ... I am sure it's not just a hardware issue, though, ... as well as the kernel configuration I am using. ... # ACPI (Advanced Configuration and Power Interface) Support ...
      (Linux-Kernel)
    • Re: Need your opinion - important!
      ... > Responsible for computer systems / hardware configuration management. ... > instruments in order to validate new computer configurations; ... > DUTIES AND RESPONSIBILITIES ...
      (comp.software.testing)
    • Re: Need your opinion - important!
      ... >Responsible for computer systems / hardware configuration management. ... >instruments in order to validate new computer configurations; ... >DUTIES AND RESPONSIBILITIES ...
      (comp.software.testing)
    • Need your opinion - important!
      ... Responsible for computer systems / hardware configuration management. ... instruments in order to validate new computer configurations; ... DUTIES AND RESPONSIBILITIES ...
      (comp.software.testing)