Re: Lockless uidinfo.



On Sat, Aug 18, 2007 at 02:00:56PM +0200, Pawel Jakub Dawidek wrote:
Hi.

The patch below remove per-uidinfo locks:

http://people.freebsd.org/~pjd/patches/uidinfo_lockless.patch

With the patch uidinfo is handled using atomics only and no locks
(except for the global hash lock, which is not really important, as it's
not used in the fast paths).

I needed to change ui_sbsize from rlim_t (64bit) to long, because we
don't have 64bit atomics on all archs, and because sbsize represents
size in bytes, it can't go beyond 32bit on 32bit archs (PAE might be a
bit of a problem).

I changed maxval argument in chgproccnt() from int to rlim_t, as this is
what is passed to the function.

In simple ping-pong test on unix domain socket, uidinfo lock was highly
contented:

max total wait_total count avg wait_avg cnt_hold cnt_lock name
1508 3242859 96267052 1467476 2 65 374553 1024743 /usr/src/sys/kern/kern_resource.c:1339 (sleep mutex:sleep mtxpool)

The ping-pong program you can find here:

http://people.freebsd.org/~pjd/misc/unixpingpong.c

At the end we reduced uidinfo structure size by 8 bytes and gain no
measurable performance improvements:)
Yes, I wasn't able to measure anything interesting, unfortunately, but I
still believe the patch should be committed, as I'm sure there are
workloads that will see improvements - note that uidinfo structure is
per-uid, so if there are thousands of processes running with the same
uid, they all need to fight for this one lock.
Not only contention is important, but also the fact that number of
atomic operations in chgsbsize() and chgproccnt() functions was reduced
from 2 to 1.

Ok, while writting this e-mail I came up with a better benchmark:

http://people.freebsd.org/~pjd/misc/unixpingpong2.c

This one doesn't do ping-pong between two processes, but within one
proccess only. This way we eliminate cost of context switches. I was
running 8 such processes (I tested it on a 8way machine) and here are
the results:

x ./uidinfo1.txt
+ ./uidinfo2.txt
+------------------------------------------------------------------------------+
|x +|
|xx ++|
|xx ++|
|A| |A|
+------------------------------------------------------------------------------+
N Min Max Median Avg Stddev
x 5 402742 416417 406216 408269.2 5388.3485
+ 5 1561566 1575987 1568964 1569767 5853.1399
Difference at 95.0% confidence
1.1615e+06 +/- 8204.54
284.493% +/- 2.00959%
(Student's t, pooled s = 5625.55)

As you can see this was just a matter of good benchmark - here we can see
284% performance improvement, yay:)

Kris asked me to verify just in case that the unit cost is not higher
and here are the results from running only one ping-pong process (so no
contention):

x ./uidinfo_up1.txt
+ ./uidinfo_up2.txt
+------------------------------------------------------------------------------+
|x x x x x + + + + +|
| |________A_M_______| |___________A_M__________| |
+------------------------------------------------------------------------------+
N Min Max Median Avg Stddev
x 5 415023 420059 418405 418016.6 1878.5011
+ 5 423731 430984 428084 427735.8 2623.4883
Difference at 95.0% confidence
9719.2 +/- 3327.59
2.32508% +/- 0.796043%
(Student's t, pooled s = 2281.61)

We can see 2.3% of improvement most likely because of atomic operations
reduction.

--
Pawel Jakub Dawidek http://www.wheel.pl
pjd@xxxxxxxxxxx http://www.FreeBSD.org
FreeBSD committer Am I Evil? Yes, I Am!

Attachment: pgpC7NawUCXhQ.pgp
Description: PGP signature



Relevant Pages

  • Lockless uidinfo.
    ... With the patch uidinfo is handled using atomics only and no locks ... In simple ping-pong test on unix domain socket, uidinfo lock was highly ...
    (freebsd-arch)
  • Re: New cpu_switch() and cpu_throw().
    ... I checked in the new version of cpu_switchfor amd64 today after ... it's switched out and acquire a lock when it's switched in. ... switching into a thread that is set to the blocked_lock another cpu is ... As we discussed, we can just use sfence rather than atomics on amd64, however, x86 will need atomics since you can't rely on the presence of *fence. ...
    (freebsd-arch)
  • Re: BUG in 2.6.20-rt8
    ... lock, do the compare-and-exchange by hand, and then release the lock. ... same as a spinlock, so the real cost is the same. ... might be a little faster with more spinlocks and less atomics). ...
    (Linux-Kernel)
  • Re: sched_lock && thread_lock()
    ... Attilio has been addressing the parts of the kernel which do not need to fall under the scheduler lock and moving them into seperate locks. ... since locks are still needed for accesses to more ... Depending on the architecture PCPU may be more expensive than atomics so we decied not to use it all over. ...
    (freebsd-arch)