Re: FreeBSD 5.3 Bridge performance take II

From: Matthew Dillon (dillon_at_apollo.backplane.com)
Date: 09/09/04

  • Next message: Ketrien I. Saihr-Kesenchedra: "Re: Still getting Signal 6 with BETA3"
    Date: Thu, 9 Sep 2004 01:55:01 -0700 (PDT)
    To: Robert Watson <rwatson@freebsd.org>
    
    

    :% One nice thing about using this experimental code is that I hope it will
    :% allow us to reason more effectively about the extent to which improving
    :% per-cpu data structures improves efficiency -- I can now much more
    :% easily say "OK, what happens if eliminate the cost of locking for common
    :% place mbuf allocation/free". I've also started looking at per-interface
    :% caches based on the same model, which has some similar limitations (but
    :% also some similar benefits), such as stuffing per-interface uma caches
    :% in struct ifnet.
    :
    :I.e., using per-thread UMA caches is a 30-60 minute hack that allows me to
    :explore and measure the performance benefits (and costs) of several
    :different approaches, including per-cpu, per-thread, and per-data
    :structure/object caching without doing the full implementation up front.
    :Per-thread caching, for example, can simulate the effects of
    :non-preemption and mutex avoidance in micro-benchmarking, although in the
    :general case under macro-benchmark perspective it suffers from a number of
    :problems (including the draining, balancing, and extra storage cost
    :issues). I didn't attempt to address these problems under the assumption
    :that the current implementation is a tool for exploring performance, not
    :something to actually use.
    :

        Well, I see some major problems with this avenue of development.
        I see this as an end-run around existing, broken (or perceived to be
        broken) APIs.

        If you don't believe that the slab allocator has severe performance
        issues then the slab allocator (aka malloc()) should simply be used
        directly. If you do believe that the slab allocator has severe performance
        issues then the correct solution is to FIRST FIX THE SLAB ALLOCATOR.
        Until the slab allocator is fixed the system-wide overhead will skew
        the results from any other optimization tests you try to make. i.e.
        results from other optimizations may appear to be less effective simply
        due to being washed out by the slab allocator's overhead.

        In the same respect, the idea that a per-thread memory cache is going to
        be more efficient then a per-cpu memory cache implies that it is too
        expensive to implement the locking required to implement a per-cpu
        memory cache vs a per-thread memory cache. If that is the implication,
        the solution is to fix the required locking. Frankly it should be no
        more expensive then a critical section and a critical section to
        access per-cpu data should be no more then a nanosecond or two more
        expensive then access to a per-thread data structure.

        Per-thread caching APIs have major design hurdles to overcome. I've
        already listed a few of them, but there are many, many more. For example,
        locality of reference may seem to be a slam dunk but you actually get
        better locality of reference with a per-cpu cache, especially when a
        thread migrates between cpus, but also because the cache is able to
        take advantage of and reuse a very recently reused chunk of data
        that might have been freed by some other thread on the same cpu... so
        you get it even after a context switch (and you don't get that with a
        per-thread cache). This is a case that can occur quite often, especially
        when interrupt threads are shipping data to protocol threads. I could
        go on and on, but I am not going to because I think it damn well ought
        to be obvious.

        It seems clear to me that it makes little sense to spend time
        on a per-thread memory cache when a per-cpu memory cache is, just from
        an algorithmic point of view, going to be far more effective, far
        easier to manage, possible to have larger (deterministic) hysteresis
        without creating too much non-deterministic slop, and so on and so forth.

        If this is just an experiment, and therefore something that will never
        be committed, then I still don't understand why you are even wasting time
        working on it when you could be fixing the slab allocator instead.
        IMHO I do not believe that a per-thread memory cache would even come
        close to characterizing the performance benefits of a mutexless per-cpu
        cache. While the base overhead is similar, the side effects and
        management requirements are going to be very different. If the
        experiment is supposed to characterize the potential performance
        improvement over the slab allocator, then I believe it is already
        too flawed to be an accurate measure of that.

        Now does this mean that caches in front of the slab allocator are bad?
        No, it doesn't. I use front-end caches in several places in
        DragonFly and you definitely want to do the same thing in FreeBSD.
        The difference, I think, is that we only use front-end caches for
        performance-critical subsystems and when we do, we implement the
        code directly into the subsystem. We depend on our core slab allocator
        (i.e. malloc()) far more then you seem to want to depend on yours.
        FreeBSD seems to depend on its UMA type-stable zone allocator for the
        same thing but that has a lot of extra, unnecessary overhead. The
        MBUF allocator is a good example of this. You guys are making several
        extra procedural calls that we aren't in the mbuf allocation path.
        The way I see it, if it's important enough to require its own front-end
        cache over simply using the slab allocator, then you might as well go
        whole hog.

        But it is also important to make your slab allocator fast so there
        are fewer situations where you feel that you need to bypass it. A
        front-end cache is not supposed to be a workaround for a slow
        slab allocator but instead is supposed to be an ultra-optimized
        implementation to reduce overhead in a critical subsystem.

    :In doing so, my hope was to identify which areas will offer the most
    :immediate performance benefits, be it simply cutting down on costly
    :operations (such as the entropy harvesting code for Yarrow which appears
    :to have found its way into our interrupt path), rethinking locking
    :strategies, optimizing out/coalescing locking, optimizing out excess
    :memory allocation, optimizing synchronization primitives with the same
    :semantics, changing synchronization assumptions to offer weaker/stronger
    :semantics, etc.
    :...
    :
    :Robert N M Watson FreeBSD Core Team, TrustedBSD Projects
    :robert@fledge.watson.org Principal Research Scientist, McAfee Research

        Well, I guess identifying the problem areas is good, but it isn't
        rocket science. It should be glaringly obvious without requiring
        all that much actual testing.

        One big problem you face is that a great deal of the performance
        issues in FreeBSD are not from any single subsystem, but from overheads
        sprinkled throughout the entire codebase. Taken singly, these overheads
        are not significant. A single mutex could be argued to be not all that
        significant... but having to obtain and release 11 mutexes in a code
        path *IS* significant. A single memory allocation might not be
        significant, but having to make three or four in a critical path can be.
        And so on, and so forth.

        Testing performance with little tweaks here and there is not going to
        give you any worthwhile results, IMHO. Just fixing one thing isn't
        going to solve the performance problem. You have to fix the entire
        path. It isn't JUST the mutex overhead that's the problem. It's
        the mutex overhead, the scheduler overhead, all the myrid calls you
        have to make to pin the thread, or enter a critical section (if you
        have to do it too often)... it's the atomic-access requirement to the
        per-cpu %fs:globaldata data which makes it impossible to cache
        per-cpu data. It's the 4BSD/ULE scheduler being used to schedule
        kernel threads, its the thread switching overhead, the microtime calls
        in the switch path, coding requirements to deal with preemption, cpu
        migration, giant lock handling, uma zone allocator's callback API,
        and a dozen other things. Taken singly these items produce a
        fractional degredation. Taken singly these items aren't necessarily
        even 'bad'. Taken together and you have... well, you have the
        situation you find yourself in now.

        To really fix the problem you need to be willing to clean up ALL of
        these subsystems. Well, first you need to recognize that they all need
        to be cleaned up, and I gather that some FreeBSD developers still don't
        recognize that as the problem. Once you recognize that its a problem
        you then need to go and do the work. Cleaning up the subsystems is only
        the first step... it gives you a nice solid reasonably high performing
        base to work with. You still have to deal with the higher-level
        critical pathing issues once you've cleaned up the subsystems. This is
        where things like front-end caches really show their stuff.

                                                -Matt
                                                Matthew Dillon
                                                <dillon@backplane.com>
    _______________________________________________
    freebsd-current@freebsd.org mailing list
    http://lists.freebsd.org/mailman/listinfo/freebsd-current
    To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"


  • Next message: Ketrien I. Saihr-Kesenchedra: "Re: Still getting Signal 6 with BETA3"

    Relevant Pages

    • Re: lmbench lat_mmap slowdown with CONFIG_PARAVIRT
      ... promise to have no measurable runtime overhead. ... any real $cache overhead. ... The PV kernel has over 100K larger text size, nearly 40K alone in mm/ and ... User time also ...
      (Linux-Kernel)
    • Re: strange performance of nested virtual methods
      ... If you dissasemble the code generated and count machine cycles, ... about the overhead from calling a virtual method, ... The ServiceDecorator Class instantiations inherit the val member, ... This could be verified by disabling the cache and rerunning the tests. ...
      (microsoft.public.dotnet.framework.performance)
    • Re: strange performance of nested virtual methods
      ... > If you dissasemble the code generated and count machine cycles, ... > about the overhead from calling a virtual method, ... > This could be verified by disabling the cache and rerunning the tests. ... which does exactly the same as "ServiceDecorator". ...
      (microsoft.public.dotnet.framework.performance)
    • Re: Thanks
      ... for numbers of objects growing with the size of the cache. ... It's pretty easy to implement and has little run time overhead. ... I'm not sure what EJB and Hibernate do about this but since this is such a common scenario I bet they have solved this already. ... There is also db4o which also has nice querying features IIRC: http://www.db4o.com/ ...
      (comp.lang.java.programmer)