Re: On-chip, 7-way associative



In article <ttKdnbWN5ppb13bZnZ2dnUVZ_oadnZ2d@xxxxxxxxxxxxxxxxxxxxxxxx>,
Bill Todd <billtodd@xxxxxxxxxxxxx> wrote:

Robert Deininger wrote:
In article <EfadnZeFg-2t1nfZnZ2dnUVZ_s6dnZ2d@xxxxxxxxxxxxxxxxxxxxxxxx>,
Bill Todd <billtodd@xxxxxxxxxxxxx> wrote:

...

As usual, 'it depends'. In fact, EV7 has slightly (but *only* slightly)
worse performance than EV68 with 16 MB of fast board-level L2 at the
same clock rate in SPECint:

Actually, the two may have been just about equal in SPECint after
normalizing clock rates (EV68 had a significantly better score but
clocked 100 MHz faster).

having only about 1/9th as much cache is
significantly offset by the approximate halving in cache latency (to a
latency of about 10 ns. for EV7, while the off-chip EV68 L2 was IIRC
close to 20 ns. and also direct-mapped - single-way-associative - which
further decreased its relative effectiveness) and the approximate
halving of main-memory latency (from about 160 ns. with EV68 to about 80
ns. with EV7 IIRC).

ES45 CPUs come in 2 flavors:
1 GHz CPUs with 8 MB of L2 cache/CPU (32 MB maximum system cache)
1.25 GHz CPUs with 16 MB of L2 cache/CPU (64 MB maximum system cache)

The quad-socket 1.25 GHz ES45 TPC-C submission appeared to contain only
16 MB total...

The ES45 L2 cache is on the CPU board, but not on the CPU chip. Any of
the CPUs can access the whole system's cache at full speed.

... but the ES45 SPECint submission indeed states that 16 MB per CPU is
supported (one might infer from the paperwork that only 16 MB was used,
but the HP citation that Hoff provided above states that it had 32 MB -
neither fish nor fowl).

Strange.

Agreed.

Benchmarks are bad enough, but benchmarks with poorly-specified
configuration details are even worse.

Something must be in error with some of these reports.


I don't think the ES45 memory latency is as bad as you remember, but I
don't have the numbers at my fingertips.

The 160 ns. figure I recall from somewhere or other; the Los Alamos
citation that Hoff provided pegs it at 170 ns. (close enough).

It's fairly dependent on how you load the RAM. I'd have expected the Los
Alamos system to get this right.

I thought it was in the 120-130 range, but that's purely from memory.

ES47 memory latency is slightly
better than ES45; I don't remember it being twice as fast.

Hoff's HP citation lists 75 ns. as the local-RAM latency for the GS1280;
his Los Alamos citation lists 83 ns. (for a GHz GS1280 in late 2002
running at 1.2 GHz - how interesting).

Well, I'll chalk this one up to bad memory - mine.

We've certainly seen workloads where ES45 beats ES47, and vice versa. It
turns out a lot of real-world work fits in the ES45 cache, but not the
ES47 cache.

The Wiki post that Dan Foster cited shows a marked drop-off in
increasing cache effectiveness for SPECint at sizes exceeding 1 MB, and
the same is true for analyzes of TPC-C-style workloads that I've seen.
So while I don't doubt that *some* real-world applications may be far
better-suited to 16 MB (or better yet 64 MB) of slower, non-associative
off-chip cache, I doubt that this is anything like *typical*.

One obvious case that can enjoy lots of cache is a random mishmash of
unrelated workloads.


Yes, EV7 is optimized for large systems, and turned out beautifully. It
isn't cost-effective for small systems.

I question that, especially given AMD's experience in small systems with
a not-all-that-dissimilar system architecture (on-chip memory controller
and router, 1 MB on-chip L2 cache).

The architectures may be pretty similar on paper, but the costs are in the
implementation. I don't know AMD's costs. I have a pretty good idea of
EV7 costs.

ES45 uses a chipset optimized for small systems and lower cost. It
doesn't scale past 4 CPUs. In this size system, large caches are very
effective in terms of both cost and performance.

But so are smaller, faster on-chip caches with much higher associativity
- plus radically lower main-memory latency.

I don't know how much the high EV7 associativity contributed to the cost
or development time of the CPU. I doubt it was negligible.


And fast off-chip SRAM and separate memory-controllers aren't that
cheap, either. I question (again) whether overall small-system board
costs would have been any lower using EV68 than EV7: in fact, I suspect
the opposite - if the components were priced reflecting HP's cost rather
than reflecting what HP thought it could get for them.

I'm not at liberty to share what I know about the costs in public, obviously.

Don't confuse prices, which HP can control, with costs, which HP can't
control. HP (and others) can and has fiddled with prices to push
customers toward the fad of the quarter. Costs are much less flexible.
Costs of single-source CPU chips, with a custom-designed process, are
hardly flexible at all.



GS320-class systems don't use the ES45 chipset, and they have much worse
memory performance at every system size.

GS320-class systems don't use EV68 CPUs, BTW. They use EV6 or EV67.

Mea culpa.


So for single-processor workloads which are 'cache-friendly' EV7 is only
about on a par with EV68 plus 16 MB of off-chip L2 (though even there by
virtue of eliminating a bunch of other board components EV7 is likely a
considerably more cost-effective design).

Yes, ES45 has a few memory and cache ASICs that add up to perhaps a couple
of hundred dollars of cost per system. But EV68 CPUs are around half the
cost of EV7.

Now, why do you suppose that is? The additional chip area consumed by
the additional EV7 features does roughly double the chip size IIRC, but
after packaging and testing are taken into account that hardly doubles
the overall part cost to produce.

The EV7 uses a completely different fabrication process than EV68. I know
it was difficult and time-consuming to get it working and get adequate
yields for EV7. The packaging for EV7 is more expensive as well. I don't
know the CPU vendor's costs and profits, but I know HP's cost for EV7
fairly well. They are not inexpensive CPUs.


And, of course, *board* complexity (in terms of additional parts count)
costs something as well.

Both system boards are very complex; there are lots of components, lots of
layers, and both run close to their respective limits of signal integrity.

And the cables and connectors that link together EV7 system
boxes are FAR more expensive than the ES45 chipset.

Except that when we're talking about a small-system configuration (as I
believe we are here) we don't *need* any of those.

Sort of. At 4 CPUs and above, the interconnects are a significant issue
for EV7 systems. At the 2 CPU level, you mostly avoid it.


We may be looking at these question from two radically different
directions. I'm not particularly interested in the best, cheapest system
that could have been produced in a given architecture. These armchair
design exercises can be fun, partly because they can ignore inconvenient
facts to optimize the hypothetical.

I've been much more concerned with making successful PRODUCTS, in the real
world of Compaq/HP's business needs and target market. In that world,
there are real constraints on available engineering and management talent,
costs of parts, and market timing. Lots of things that "should" have been
done, weren't done because they weren't practical and/or timely.


Consider a few cases...

For a 1 CPU EV7 system, you have to implement single-CPU power and
packaging to replace the 2-CPU duo used on all existing EV7 systems. I'm
sure that could be done, but there's little to re-use from existing Alpha
system design. You'd end up with a system board, a CPU, some memory, and
an I/O subsystem. There's a good chance you have to make a new chassis
and a new power distribution subsystem. That would take a while. The CPU
would cost what EV7s cost. The RAMBUS memory that never quite caught on
would be expensive. The SRM firmware is unlike anything done before, and
costs quite a lot to develop and debug. The advantage would be capturing
a good fraction of the potential performance benefits that you and others
have noted.

Put that up against the DS15, which re-used the DS10 chassis, the XP1000
power supply, and much of the DS25/ES45 electrical design and firmware.
The CPU is much cheaper than EV7 (you'll have to trust me), but a few
additional ASICs (highly leveraged from earlier systems) are needed
compared to the EV7 solution. And we have to pay for the L2 cache, but at
least that's fairly standard technology that's on a downward price curve
over the life of the product. The DS15 came to market in about 1 year,
driven by an obvious gap in the product line and a clear need to minimize
the sales price of the system. Every feature that added significant risk
to the schedule, or added significant cost, was excluded. Performance was
MUCH less important than cost according to the marketing gurus. This
system was mainly for the VMS market; it was after the HP merger and Tru64
was already on the decline. The VMS market didn't want a 1-CPU system
that pushed for performance; it wanted a low-cost system, SOON.

Maybe in a different marketing situation, performance would have been more
important, and the extra cost of the EV7-based 1-CPU system would have
been justified. I did have the impression that a later generation (EV8
timeframe?) of small systems was planned. 1 CPU systems might not have
made it, but a dedicated 2-CPU design would have been very likely.

I don't know if I can give enough detail to convince you, but DS15 was
much less expensive to design and build than an EV7-based 1-CPU system
would have been at the time. Change the target market environment,
increase the expected system volume by a factor of 5 or 10, allow an extra
year to come to market, or otherwise alter reality in a major way, and
there's a chance the choice would have swung the other way.


How about the 2 CPU space?

Here we can compare two real systems, the DS25 and the ES47 Tower. Both
are still for sale, which tells me they both have willing customers, and
HP is making money on both of them. ES25 was an "easy" scaling-down of
the existing ES45, with extensive re-use from DS20E. For customers who
need more memory or IO than the DS15 can provide, the DS25 can be built
with only 1 CPU at lower cost than the 2-CPU standard. Time-to-market was
hampered by a couple of engineering problems that were unexpected and
unavoidable, so the system was late. It SHOULD have shipped earlier than
it did.

ES47T re-uses the ES80 building block in the smallest reasonable
configuration. The box incurs some large-system cost overhead
(scalability and interconnect stuff that isn't used), but this was deemed
a better solution than re-engineering a dedicated subsystem for the 2-CPU
space. There's no way to configure this system with just 1 CPU to save
money; that's another schedule vs. cost tradeoff.

ES47T enjoyed some early discounts intended to soup up EV7 volume, so the
end-user cost was attractive. I haven't priced out a system in a year or
more, so I don't know if the discounts are still there. (I'm on a slow
connection so I won't brave HP's web site to try it now.) But again, I
know DS25 is less expensive for HP to build than ES47T.



Alpha system volumes
were never big enough to get significant economies of scale from stuff
like ASICs, where the development and test costs are huge.

All the more reason why eliminating them likely saves considerably more
than the cost of the increased chip area required to incorporate them on
the chip.

The biggest problem is that the EV7 solution, as it came into being in the
real world, is painfully expensive. It dwarfs things like ASICs. But I
agree with you, if EV7 had been cheap enough, and if system volumes had
been high enough to warrant the redesign, and if there had been time to do
the work, an EV7 solution in the 1-2 CPU space might have been cheaper
than the ones HP actually designed.


It occurs to me that there's a path not taken (and not even considered by
the powers that has-been, AFAIK). Implement the EV7 design in a cheaper,
less aggressive process -- the EV68 process or something even blander.
Target a much lower CPU cost and give up the cutting-edge performance.
Take advantage of the large-system scaling by throwing more CPUs at the
workload across the whole product line from 2 CPUs and up. This idea
seems plausible in hindsight, knowing how painful the EV7 fabrication
process turned out to be, and how good the system architecture is. Would
it have succeeded? I don't know. I'm pretty sure it's too late to find
out now.
.



Relevant Pages

  • Re: On-chip, 7-way associative
    ... worse performance than EV68 with 16 MB of fast board-level L2 at the same clock rate in SPECint: ... the CPUs can access the whole system's cache at full speed. ... I question whether overall small-system board costs would have been any lower using EV68 than EV7: in fact, I suspect the opposite - if the components were priced reflecting HP's cost rather than reflecting what HP thought it could get for them. ...
    (comp.os.vms)
  • Re: On-chip, 7-way associative
    ... ES47 cache. ... EV7 broke enough new ground that it was difficult to ... True, the EV68 development costs were already sunk, ... Costs of single-source CPU chips, with a custom-designed process, are ...
    (comp.os.vms)
  • Re: On-chip, 7-way associative
    ... specific cache line can occupy without collision, ... The ES45 L2 cache is on the CPU board, but not on the CPU chip. ... EV7 was a tremendous improvement over EV68 despite using ... ES45 uses a chipset optimized for small systems and lower cost. ...
    (comp.os.vms)
  • Re: Best HD reciever?
    ... Do you have any suggestions for a low cost new PC? ... What minimum CPU ... How many PCI slots? ... The first frontend I put on the network was an old Duron 1600 with another ...
    (alt.tv.tech.hdtv)
  • Re: Inter-process call overhead under Windows XP SP2
    ... The number of arguments affects the memory copies ... if you have a dual core CPU and/or a lightly loaded machine. ... COM server ... Roughly what do you think would be the cost of a COM/LPC call with 10 ...
    (microsoft.public.win32.programmer.kernel)