Re: [PATCH] Mantaining turnstile aligned to 128 bytes in i386 CPUs



On Wed, 17 Jan 2007, Ivan Voras wrote:

Kip Macy wrote:
Maybe even someone finds a way to get optimized versions of memcpy in
the kernel :)

It makes a huge difference in a proprietary file serving appliance
that I know of.

Beneficial difference?

Heheh.

However, past measurements on FreeBSD have supposedly
indicated that it isn't that big win as a result of increased context
switch time.

No, they indicated that the win is not very large (sometimes negative),
and is very machine dependent. E.g., it is a small pessimization all 64
bit i386's running 64-bit mode -- that's just all i386's you would want
to buy now. On other CPU classes:

P2 (my old Celeron): +- epsilon difference
P3 (freefall): +- epsilon difference
P4 (nosedive's Xeon): movdqa 17% faster than movsl, but all other cached
moves slower using MMX or SSE[1-2]; movnt with block prefetch 60% faster
than movsl with no prefetch, but < 5% faster with no prefetch for both.
AXP: (my 5 year old system with a newer CPU): movq through MMX is 60%
faster than movsl for cached moves, but movdqa through XMM is only 4%
faster. movnt with block prefetch is 155% faster than movsl with no
prefetch, and 73% faster with no prefetch for both.
A64 in 32-bit mode: in between P4 and AXP (closer to AXP). movsl doesn't
lose by so much, and prefetchnta actually works so block prefetch is
not needed and there is a better chance of prefetching helping more
than benchmarks.

Bruce
_______________________________________________
freebsd-arch@xxxxxxxxxxx mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-arch
To unsubscribe, send any mail to "freebsd-arch-unsubscribe@xxxxxxxxxxx"



Relevant Pages