Re: Itanium Solutions Alliance

From: Rob Young (young_r_at_encompasserve.org)
Date: 09/03/05


Date: 3 Sep 2005 00:48:42 -0500

In article <-eGdnWNGVZLUiYTeRVn-jw@metrocastcablevision.com>, Bill Todd <billtodd@metrocast.net> writes:
> Rob Young wrote:

>> Sure, with twice the CPUs.
>
> No, moron: on a strictly per-core basis.

        No. Twice. See your SAP SD 2-tier below. You must have missed
        that 128 is 2 * 64.

> How about SAP SD 2-tier? SPARC64 holds top
> honors there as well, with 128 cores scoring 21K - again edging out
> POWER5's 20K using 64 cores

        Right. For twice the CPU count, they are nearly the same.
        There is 1 stale tpc entry by Fujitsu out there on tpc.org.
        Anyone running Sun kit knows to steer clear of tpc as it
        most assuradely demonstrates SPARC weakness. Again, since
        Oracle licenses are very expensive if makes little sense
        to go the SPARC route.

http://www.itjungle.com/breaking/bn083005-story01.html

By vendor, IBM was the dominant supplier of servers in the second quarter, with
$3.892 billion in sales, up 4.1 percent and accounting for 31.9 percent of the
market. HP was number two, with $3.48 billion, up 11.5 percent and giving it
28.5 percent of the market. HP is clearly--and finally--getting some traction
after many years of merging itself with Compaq. Sun was third with $1.372
billion in sales, down 5.3 percent and giving it 11.3 percent of the market.

>> But Fujitsu's projections are multi-billion in IPF sales in
>> the next 3 years.
>
> Hey, in Y2K IDC was projecting that Itanic would sell $28 billion in
> servers - in 2004. Talk is cheap, and often utterly worthless
> (especially, it would seem, when it describes Itanic's glowing future).
>

        It appears they fell short.

>>
>> Right, and the point of the slide I referenced. No infrastructure
>> changes to support 3 generations if Itanium.
>
> So they don't have to lift a finger: this is supposed to demonstrate
> their unswerving and monumental commitment, rather than an architecture
> which can now be left on auto-pilot until they've had a chance to see
> whether it's going to auger in or manage to clear the treetops for
> another year or two?
>

        No. It's the whole fork-lift upgrade thing. That's why
        there is rumblings about Sun's futures as a forklift upgrade
        will be coming (Rock, Niagra) at a time when they are declining.
        Not good.

> I'll note that he appears to
>>>expect miracles (funny how that kind of thinking keeps cropping up among
>>>Itanic supporters) from a Montecito 1.6 GHz part that clocks at the same
>>>speed the current Madison II does, and has almost exactly the same
>>>amount of cache per core that the current Madison II does (somewhat more
>>>L2, but the L1 and massive L3 are the same),
>>
>>
>> You are downplaying cache improvements.
>
> No, Rob: as usual, you're hyping them far beyond what they're likely to
> be worth. It's such a well-established pattern with you that you're
> probably not aware of it yourself any more.
>
> As the designers point
>> out, cache was a major improvement:
>>
>> http://www.ewh.ieee.org/r5/denver/sscs/Presentations/2005.03.Montecito1.pdf
>>
>> Check out slide 4 and you see:
>>
>> New level of cache
>
> Not a new level at all, just a splitting and expansion of the existing
> L2 in McKinley/Madison. And no faster access to it, either: still 6
> cycles (actually, IIRC it may have been only 5 cycles before).
>

        That's their description. What does the designer know about
        his design anyway?

>> . 6 cycle 1MB L2I
>> . Addresses largest CPI component for transaction processing
>> (Instruction misses)
>> . Frees 256KB L2D to be dedicated for data
>> . ECC add in L2T tags, and parity added to L1I TLB
>> . Parity in FP and Integer
>>
>> Cache improvements helps transaction processing.
>
> No *** - but hardly dramatically in this case.

http://h21007.www2.hp.com/dspp/files/unprotected/Itanium/sizingsuperheavys.pdf

The biggest change was to split the unified 256KB L2 cache of the McKinley and
Madison into separate 1MB L2 instruction cache and 256KB L2 data cache. This
was done basically to eliminate instruction stream competition for the
bandwidth and capacity of the L2 data cache. The 16KB instruction caches of
the Madison and Montecito hold only 1024 instruction bundles which represents
about 2.4k useful instructions taking into account the ~20% structural NOP
content of a typical IA64 executable. To put that into perspective, that is
only about 1/7th the instruction capacity of the POWER5.s 64KB instruction
cache. Obviously for many classes of programs, instruction stream fetching will
represent a significant portion of the processor requests on the unified 256KB
L2 as well a large portion of its contents.

By splitting the L2 caches in Montecito a lot of good things happen. From the
data stream perspective, the 256KB L2 suddenly has one less port, and its
entire 256KB capacity is available for data. This means less contention and
stalls and fewer capacity and conflict misses. This adds up to more predictable
memory hierarchy behavior, a very important feature for an architecture that
relies heavily on static instruction scheduling. From the instruction stream
perspective, the L2 I-cache can be located physically close to the L1 I-cache
and its design optimized for the task. It doesn.t need to be multi-ported or
support sub-word access. As a result the 1MB L2 I-cache in Montecito likely has
little or no latency penalty over the 256KB L2 D-cache, despite having four
times its capacity. The combination of a very fast latency (1 cycle) L1 I-cache
and large and fast L2 I-cache has operational characteristics are impossible to
duplicate in a single level cache. For example, if 90% of instruction stream
accesses that hit in the L1 or L2 hit in the L1, and the L2 has a latency of 6
cycles, than the L1/L2 combination performs like as a single level 1MB
instruction cache with average latency of 0.9*1 + 0.1*6 = 1.5 cycles. This is
half the latency of the 64KB instruction cache in the Alpha EV6/7 and AMD K7/8.

---
	Taking Madison's 256 KB Unified L2 and splitting it into 1 MB I
	and 256 KB D cache has what percentage increase in performance?
	Not dramatic?  How would you characterize the increase?
>>>The only *real* performance enhancement is 
>>>Montecito's somewhat primitive dual-thread-per-core capability,
>> 
>> 
>> 	Not so... see cache improvements above.
> 
> I saw them, Rob - I even mentioned them before you did.  They're just 
> nowhere nearly as impressive as you'd like people to believe.
> 
	How so?  Can you refer us to a whitepaper that discusses this
	in depth?  Thanks.
>> 
>> 	Ah... as you and MAS were exchanging pleasantries over this
>> 	slide:
>> 
>> http://img359.imageshack.us/img359/712/untitled13jh.gif
>> 
>> 	HP/Intel talks about 200 tpmC 2S performance, as you point out
>> 	Power5+ in your estimation will do 270.  But I'll guess Intel
>> 	is sandbagging a bit
> 
> Of course you will, Rob - just as you thought they were with McKinley's 
> clock rate, which you believed would hit 1.4 GHz or even more.
> 
	Ah... sweet.  I know you are straining when you dodge and
	dig dig dig instead.  B^)
>   - they have been known to do this and it
>> 	makes sense - upside surprise, etc.  But suppose they are sandbagging
>> 	5% and actually do 210
> 
> Why should anyone suppose that, Rob?  I mean, they came right out and 
> said 200, and for the reasons I listed that number is credible (at least 
> if they're using HP's zx1 chipset to obtain it).
> 
	Sure.  That would explain why it is so low, it isn't the zx2.
>   - maybe not.  But
>> 	I'm going to say 250K... Either way, a 1.8 GHz Montecito 
>> 	(Foxton to 2.0 GHz) will surely do better than this recent 2S DB2 
>> 	submission of 200K:
>> 
>> http://www.tpc.org/tpcc/results/tpcc_result_detail.asp?id=105080802
> 
> Assuming, of course, that Intel manages to get Montecito to ship at 1.8 
> GHz - which the numbers they've released so far suggest may be a bit 
> difficult.  But you also seem reluctant to confront the fact that a year 
> ago POWER5 was significantly out-performing the score you refer to above 
> (obtained on Linux) when running on AIX:  430K tpmC for 8 cores, which 
> means *at least* 215K tpmC on 4 cores when running on AIX.
> 
>> 
>> 	So you are so opposed to Montecito pulling even with Power5,
> 
> Wrong yet again, Rob:  your original poppy*** to which I responded was 
> that "Montecito will pull past Power5 in performance", and my response 
> was "Perhaps a 1.8 GHz Montecito system (if they actually manage to ship 
> it at that clock rate) can equal that, but it will hardly 'pull past' it 
> by any significant amount."
> 
	Okay.  But that is a totally different tune.  This took all
	of 30 seconds to find:
http://groups.google.com/group/comp.arch/msg/470b858ad992ae30?dmode=source&hl=en
From: "Bill Todd" <billt...@metrocast.net>
Newsgroups: comp.arch,comp.sys.ibm.pc.hardware.chips,comp.sys.intel
Subject: Re: Itanium finally passes Alpha at HP
Date: Fri, 27 Aug 2004 13:15:04 -0400
>  I'm stunned by how good POWER5
> is.  But I know that next year Montecito will go from 1 thread per
> package to 4 threads per package.  Itanium will be down to a 16P system
> to compete with IBM's 16P system.
If you consider that having less than half the TPC-C performance of the
POWER5 system with an equal number of cores qualifies as 'competing with'
it, perhaps.
---
	Montecito is a lot stronger than you care to admit.  But that's
	okay.  You've been downplaying Itanica and it looks like it
	is finally catching up to you.
> Try a little harder to understand what you read before responding, and 
> your responses might become at least *somewhat* less incompetent.
	Sure Bill.  Have a good evening.
				Rob
Men with walkie-talkie                  I'm home again to you babe
Men with flashlights waving             You know it makes me wonder
Up upon the tower                       Sittin' in the quiet slipstream
The clock reads daylight savin'         Rollin' in the thunder
                                -- Neil Young