RE: 4.7 vs 5.2.1 SMP/UP bridging performance

From: Gerrit Nagelhout (gnagelhout_at_sandvine.com)
Date: 05/05/04

  • Next message: Ion-Mihai Tetcu: "smb_co_lock: recursive lock for object 1"
    To: freebsd-current@freebsd.org
    Date: Wed, 5 May 2004 12:32:42 -0400 
    
    

    Bruce Evans wrote:

    >> On a single hyperthreaded xeon 2.8Ghz, it took ~30 cycles
    >> (per LOCK&UNLOCK,
    >> and dividing by 100) under UP, and ~300 cycles for SMP. Assuming 10
    >> locks for every packet(which is conservative), at 500Kpps,
    >> this accounts
    >> for:
    >> 300 * 10 * 500000 = 1.5 billion cycles (out of 2.8 billion cycles)
    >>
    > 300 cyles seems far too much. I get the following times for slightly
    > simpler locking in userland:
    >
    > %%%
    > #define _KERNEL
    > #include ...
    >
    > int slock;
    > ...
    > for (i = 0; i < 1000000; i++) {
    > while (atomic_cmpset_acq_int(&slock, 0, 1) == 0)
    > ;
    > atomic_store_rel_int(&slock, 0);
    > }
    > %%%
    >
    > Athlon XP2600 UP system: !SMP case: 22 cycles SMP case: 37 cycles
    > Celeron 366 SMP system: 35 48
    >
    > The extra cycles for the SMP case are just the extra cost of
    > a one lock
    > instruction. Note that SMP should cost twice as much extra, but the
    > non-SMP atomic_store_rel_int(&slock, 0) is pessimized by using xchgl
    > which always locks the bus. After fixing this:
    >
    > Athlon XP2600 UP system: !SMP case: 6 cycles SMP case: 37 cycles
    > Celeron 366 SMP system: 10 48
    >
    > Mutexes take longer than simple locks, but not much longer unless the
    > lock is contested. In particular, they don't lock the bus any more
    > and the extra cycles for locking dominate (even in the !SMP case due
    > to the pessimization).
    >
    > So there seems to be something wrong with your benchmark. Locking the
    > bus for the SMP case always costs about 20+ cycles, but this hasn't
    > changed since RELENG_4 and mutexes can't be made much faster in the
    > uncontested case since their overhead is dominated by the bus lock
    > time.
    >
    > -current is sloer than RELENG_4, especially for networking, because
    > it does lots more locking and may contest locks more, and when it hits
    > a lock and for some other operations it does slow context switches.
    > Your profile didn't seem to show much of the latter 2, so the problem
    > for bridging may be that there is just too much fine-grained locking.
    >
    > The profile didn't seem quite right. I was missing all the
    > call counts
    > and times. The times are not useful for short runs unless high
    > resolution profiling is used, but the call counts are. Profiling has
    > been broken in -current since last November so some garbage needs to
    > be ignored to interpret profiles.
    >
    > Bruce
    >

    I wonder if the lock instruction is simply much more expensive on the
    Xeon architecture. I ran a program very similar to yours with and without
    the "lock" instruction:

    static inline int _osiCondSet32Locked(volatile unsigned *ptr, unsigned old,
                                         unsigned replace)
    {
        int ok;
        __asm __volatile("mov %2, %%eax;"
                         "movl $1, %0;" /* ok=1 */
                         "lock;"
                         "cmpxchgl %3, %1;" /* if(%eax==*ptr) *ptr=replace */
                         "jz 0f;" /* jump if exchanged */
                         "movl $0, %0;" /* ok=0 */
                         "0:"
                         : "=&mr"(ok), "+m"(*ptr)
                         : "mr"(old), "r"(replace)
                         : "eax", "memory" );
        return ok;
    }

    unsigned int value;
    ...
    for (i = 0; i < iterations; i++)
    {
        _osiCondSet32Locked(&value, 0, 1);
    }

    and got the following results:

    PIII (550Mhz) w/o lock: 8 cycles
    PIII (550Mhz) w/ lock: 26 cycles
    Xeon (2.8Ghz) w/o lock: 12 cycles
    Xeon (2.8Ghz) w lock: 132 cycles

    This means that on the Xeon, each lock instruction take 120 cycles!
    This is close to the 300 I mentioned before (assuming that both EM_LOCK
    and EM_UNLOCK use the lock instruction). I have tried reading through
    the Intel optimization guide for any hints on making this better, but
    I haven't been able to find anything useful (so far).
    This would certainly explain why running 5.2.1 under SMP is performing
    so poorly for me.
    If anyone is interested in running this test, I can forward the source
    code for this program.

    The profiling I did was missing the call counts because I didn't compile
    mcount into the key modules (bridge, if_em, etc) because it slowed things
    down too much, and relied on just the stats from the interrupt. I think
    I did run it long enough to get reasonable results out of it though. I
    have used this kind of profiling extensively on 4.7 in order to optimize
    this application.

    Thanks,

    Gerrit

    _______________________________________________
    freebsd-current@freebsd.org mailing list
    http://lists.freebsd.org/mailman/listinfo/freebsd-current
    To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"


  • Next message: Ion-Mihai Tetcu: "smb_co_lock: recursive lock for object 1"

    Relevant Pages

    • Re: Profiling multithreaded C++ code
      ... Is a 'SETGE' instruction cheaper than two 'ADDs'?) ... hold the lock for so long. ... multithreaded profiling in real loads. ... > algorithms to improve cases, ...
      (comp.os.linux.development.apps)
    • Re: How to design this circuit?
      ... How fast a lock before the output reflects ... If the answer to the first is>3 cycles, and the answer to the second is ... then you could probably do it with a microcontroller. ... should be no problem with modern microcontroller speeds even ...
      (sci.electronics.design)
    • Re: [PATCH] AMD Opteron Rev. E hack
      ... Aren't atomic_readandclear need the same workaround? ... I understood that the bug manifests itself only when lock instruction is used. ...
      (freebsd-current)
    • RE: 4.7 vs 5.2.1 SMP/UP bridging performance
      ... Note that SMP should cost twice as much extra, ... > lock is contested. ... they don't lock the bus any more ... For example, with your test above, I see 212 cycles for the UP case on ...
      (freebsd-current)
    • Re: [PATCH] x86: let 32bit use apic_ops too
      ... |> Maciej, but if we eliminate LOCK# by using simple MOV there will not ... | then we can use a straight MOV as consecutive writes are not a concern ... since - HOLD is not recognized during LOCK cycles. ...
      (Linux-Kernel)

    Loading