Re: Threading to speed a CPU-bound application?



On Jun 16, 8:39 am, Rosarin Roy <rosarinj...@xxxxxxxxx> wrote:
On Jun 15, 8:57 am, mhage...@xxxxxxxxx wrote:



On Jun 15, 12:49 am, Tim Bradshaw <tfb+goo...@xxxxxxxx> wrote:

On Jun 14, 7:30 pm, mhage...@xxxxxxxxx wrote:

Since adding the logic seemed to slow everything down, it appears that
the program is being limited by the CPU and not the disk I/O? Since I
have 8 CPUs available, I'd like to try and take advantage of that to
speed things up.

I think you're not being stupid, but before doing anything I would
spend some time getting a comprehensive, *measured* picture of what
the thing is doing and where the time is going, so that you actually
*know* where the bottlenecks are and can measure the changes as you
modify the program. Things like memory and disk access patterns can
make a very large performance difference. You'll also need to
understand what the characteristics of the machine are, which aren't
entirely simple on a 25k (bandwidth and (I think) latency depends on
where CPUs are with respect to each other). Finally, do you have 8
sockets or 8 cores? If the machine has US IV or IV+ CPUs then you
probably have either 8 cores (4 sockets, 1 uniboard, the smallest
domain) or perhaps 16 cores on 2 uniboards. I suspect the performance
differences matter there too, since cores on the same socket share
caches (which caches depends on whether it's US IV or IV+).

Since you are on a 25k you probably have decent Sun support, so one
thing would be to consider seeing if you can get help from Sun on
performance stuff. They have people who understand these machines
quite well!

--tim

Tim,

I've been trying to measure my code, but so far the only way I have
found available to me is gprof. I've requested dtrace, but the admins
won't set it up, plus I found an FAQ from Sun that says dtrace does
not work in local zones yet (I'm in a zone). Do you have any
suggestions on how I can determine where the real bottle necks are?

I have no idea what the "real" configuration of the machine is; "they"
are using mushroom management. I only know what I can get the system
to tell me:

$ psrinfo -vp
The physical processor has 2 virtual processors (0, 4)
UltraSPARC-IV (portid 0 impl 0x18 ver 0x31 clock 1350 MHz)
The physical processor has 2 virtual processors (1, 5)
UltraSPARC-IV (portid 1 impl 0x18 ver 0x31 clock 1350 MHz)
The physical processor has 2 virtual processors (2, 6)
UltraSPARC-IV (portid 2 impl 0x18 ver 0x31 clock 1350 MHz)
The physical processor has 2 virtual processors (3, 7)
UltraSPARC-IV (portid 3 impl 0x18 ver 0x31 clock 1350 MHz)

'top' reports 16G of RAM:

load averages: 0.14, 0.13,
0.17 09:46:40
158 processes: 157 sleeping, 1 on cpu
CPU states: % idle, % user, % kernel, % iowait, %
swap
Memory: 16G real, 9492M free, 3181M swap in use, 25G swap free

So, it appears there are 4 physical CPUs with 2 cores each. I'm not
sure how to use that information though. Like you said, I'd really
like to measure what's going on in my code when it runs, but I'm not
sure how to go about doing that.

Thanks,
Matthew

A couple of things to consider:
- Are you performing any random out-of-order reads/writes?
- Using your gprof output, did you tweak all your functions so that
they don't consume a lot of time?

I ran into a similar situation once when I figured that functions like
atoi(), etc. were used on the same field more than once, which
resulted in poor performance. When I saved the value of these function
calls in another variable, I could see 10-20% gain in performance.

Rosarin Roy

Roy,

The only out-of-order reads I *might* be doing would be on the input,
and that depends on when the VM system releases mmap'd pages. I made
another post about that specifically. But in general, no, I'm doing a
totally sequential read of the input data.

I'm also doing exactly what you mentioned, i.e. only performing a
single conversion on fields that exist in multiple targets, and
storing the value in a variable instead of re-converting the source
data more than once. I have also written my own versions of functions
like atoi since, 1. I needed something to deal with 64-bit numbers, 2.
I also needed an itoa function, and 3. all the source and target data
are fixed length and using the typical atoi type functions would
require that I null terminate the data (waste of time). I also
benchmarked my functions against the system functions and made sure my
code was always faster (they are written very carefully and designed
specifically for one task).

Matthew

.



Relevant Pages

  • Re: Threading to speed a CPU-bound application?
    ... where CPUs are with respect to each other). ... probably have either 8 cores (4 sockets, 1 uniboard, the smallest ... The physical processor has 2 virtual processors ...
    (comp.unix.solaris)
  • Re: Threading to speed a CPU-bound application?
    ... where CPUs are with respect to each other). ... probably have either 8 cores (4 sockets, 1 uniboard, the smallest ... The physical processor has 2 virtual processors ...
    (comp.unix.solaris)
  • Re: Performance improvement using X5355 over 5080
    ... would expect to get by swapping 2 x Xeon 5080 3.73GHz Dual-Core CPUs with 2 x Xeon X5355 2.66GHz Quad-Core CPUs? ... you are trading two cores which are 40% faster, ... And at time when you have load, but ony a few threads running, you lose, even if you win under max load. ...
    (comp.sys.intel)
  • Re: Performance improvement using X5355 over 5080
    ... would expect to get by swapping 2 x Xeon 5080 3.73GHz Dual-Core CPUs ... 40% faster, for four cores. ... And at time when you have load, ...
    (comp.sys.intel)
  • Re: Intel details future Larrabee graphics chip
    ... for dinky little SMP systems of 4-8 cores. ... Why multi-thread *anything* when hundreds or thousands of CPUs are ... video CPUs using fancy memory and generics doing the grunt work. ... Duo, and never get trojans, memory leaks, any of that. ...
    (sci.electronics.design)