Re: VAX floating-point instruction timing?



On Wed, 6 Sep 2006, Hoff Hoffman wrote:

Not all clock cycles are the same.

Not all clock ticks are the same.

So what do you mean by each of the words "cycle" and "tick" -- just to make sure I don't misunderstand anything?

As far as I know, all the VAX implementations were synchronous so all clock ticks were equally long.

AFAIK, you're wrong. There were stretched clock cycles on at least one VAX box; there was a VAX with two different lengths for its clock cycle, depending on what is going on in the i-stream. (And I'm not talking about the lower-performing microcode, that is a whole different discussion and a whole different matter.)

There are also what amount to stretched (or runt, I've forgotten which) clock ticks on Alpha systems, too.

Are you talking about timeborrowing? That is, where some pipe stage actually requires a bit more than a cycle and a neighbouring stage requires a bit less and they play a bit with the clock arrival times to make it work?

Or do you actually mean that the cycle length of a clock to the same pipe stage may differ from one cycle to the next?

If the latter, then I'm very sure you are wrong about the Alpha.

Most folks assume that the VAX systems are all similar, and from a

I know they aren't.

Some microcoded everything, some thankfully used traps and emulation in macrocode. Some had PDP-11 compatibility, others didn't. Some had vector instructions, most thankfully didn't. Some had virtualization support, some didn't. Many different buses were used. Some had a PDP-11 as a frontend processor, some in the form of J-11 single-chip CPU. Some trapped to "BIOS" code in ROM for the console stuff. Some used TTL chips, some used ECL macro arrays in various technologies, some used custom ECL chips (VAX 9000 went really crazy there). Some used custom CMOS on anywhere from a handful of chips to just one.

Some microcoded the operand decoding and the operation executing in the same state machine.

Others had a semi-independent state machine for instruction fetch, operand decode and operand fetch/store/address computation.

The number of internal registers available to the microcode varied.
How many ports they had varied.

Some had single-bit shifters, some hard funnel shifters. Some had to implement floating-point on top of a single-bit shifter and an adder, others had dedicated chips with booth multipliers using redundant representation.

Oh, and write buffers. Took DEC a real long time to put a decent number of write buffers into their machines, which seems odd given how many writes in a row you need to handle exceptions and CALLS/CALLG.

Etc.

I've never heard a VAX referred to as "simple," and a VAX microarchitecture is not "simple.".

Compared to a modern CPU, they were.


In some ways, and not in others.

Actually, in all ways.

As far as building your own microcode "real" VAX (as I might infer the goal to be here), do have fun with that. The other obvious approach to building your own ("real") VAX is via an FPGA VAX, of course.

I know, but that wouldn't be quite so fun ;)

I don't expect it to take all that much space or require all that many chips. I can use SRAMs for the registers, for the register renamer, for flag generation (as a PLA replacement), for the microopstore (that keeps mostly linear sequences of microops for complicated operand specifiers, complicated instructions, boot, and the trap/exception/fault/interrupt/machine check stuff).

I expect to use an 8051 to write the right content into the SRAMs during boot, either from a set of (E?)EPROMs or downloaded from a PC.

(I am also writing a Verilog version of the microarchitecture, to make unit testing easier.)

There are certainly ways to greatly simplify the VAX implementations and the architectures -- that said, the installed base tended to preclude that sort of thing. If you're going to break compatibility (a little), you might as well break it (a lot).

I am not even /considering/ changing the architecture. I want it to be completely compatible. Yeah, I know, they weren't completely compatible with each other but they all implemented the architecture modulo various subsets and the inevitable bug that would creep in. It's going to look most like a CVAX:

o Console stuff in ROM (and I'm not going to copy the actual VAX monitors
- as long as it can do character I/O then I'm happy)
o no virtualization support
o no vector support
o 30-bit physical address space
o no PDP-11 compatibility mode
o little to no multicpu support
o strong memory ordering
o cache coherence, both internally, and with I/O
o single-level TLBs, shared between code and data -- direct-mapped like
the 730, 8800, uVAX I, and the VAX 9000.

Multiplication/division and floating-point are probably going to be implemented last. And they are going to be slooow because my shifter is only going to handle one bit at a time -- but some of the VLSI implementations had the same problem.

Exceptions should end up being lightning fast compared to the old implementations because I do not intend to have a "modification stack" that gets updated by autoincrement/autodecrement addressing opspecs and then have to be unwound by the exception-handling microcode. Instead, I intend to use register renaming in such a way that 1) every register modification goes to a new physical register and 2) there are sufficiently many more physical than architected registers that all the original register values are kept untouched by even the worst case autoupdating instruction. When the exception comes, the register renamer can then "just" rollback to before the first register write.

The same register renamer should also allow a special lightning fast interrupt mode where the interrupt code gets to do whatever it likes with 5-8 fresh registers it doesn't have to save first. Such an interrupt mode would have been very handy for the system-on-a-chip implementations (for any low-cost VAX, actually).

It should also allow for any temporaries needed by the microops that an instruction breaks down to.

If I get this far, then I might consider adding:
o a bit in the page table so it is cheaper to distinguish between
"read/executed", "dirty (written to)" and "not touched at all".
o a 4K page size. Perhaps combined with a traditional page table tree,
perhaps not.
o a CPUID-like instruction with flags indicating what stuff is supported
and what isn't. Having to look at model numbers and revision numbers
for that isn't particularly clever.
o IEEE floating-point
o real sqrt, 1/x, 1/sqrt, sin/cos/tan instructions :)

I might (probably will) implement page table snooping so REI doesn't have to flush the TLBs. If I add caches, they will be virtually indexed and physically tagged and I will have to implement some snooping stuff that automatically invalidates "mirrored" cache lines so no byte of memory can be in more than one cache line at a time.

Bootstrapping even OpenVMS VAX is only testing a very small part of the VAX architecture, for instance. Applications tended to depend on far more of the architecture, some more so than others.

What I would really like to get my grubby hands on is AXE, the architectural verifier they used at DEC :)

Thirty years ago, folks usually wrote assembler or microcode because they needed to, and they needed the extra performance for various tasks. (I was consulting for a place that was accepting gobs of data off communications links, and they were running the SBI and the VAX-11/780 processor and the comm-boards flat-out -- the speeds and feeds are nothing now, but were right at the SBI bandwidth back then. WCS was one of the few ways available toward additional performance, and was the choice at this site prior to the availability of the VAX-11/785. I expect they continued to use WCS after the inevitable series of hardware upgrades, too.)

Wasn't it a problem that the microcode was *hard* to port between generations? And that the programmers often implemented too much in microcode so the implementation took too long (so a new machine generation would arrive by the time they were done)?

There are still applications that are processor-limited, but there are relatively fewer of them -- various other considerations have moved to the forefront at many sites.

I'd say gcc often is processor limited ;)

-Peter
.



Relevant Pages

  • Re: Cross Clock Domain Control
    ... I add a 2nd "metastable rejection" register after the 1st input ... feeding the functional logic in the target clock domain. ... think I will still add in the metastability registers for good design ...
    (comp.lang.vhdl)
  • Re: Freq. Independent Phase Shifter
    ... So the Phase Accumulator itself is a state machine which creates ... Imagine a 32-bit clockable binary register, ... Call the value in the register R. Every clock, add a 32-bit value F to ... That's pretty much what I did for a CCD shutter motor controller for a high ...
    (sci.electronics.design)
  • Re: RS232
    ... I'm new with hardware projects... ... for each pulse of clock. ... the last byte saying "return my register value at this address" is received ... until you hand back off to the abstract message into the UART transmitter. ...
    (comp.lang.verilog)
  • Re: Antii, can you give us an update?
    ... 2:1 MUX for the selecting the load value ... input of mux connected to 24 bit shift register in A2 domain ... that load the load enable signal is FULLY DISABLED ... clock domains; with purely combinational logic ...
    (comp.arch.fpga)
  • Re: Parallel NCO (DDS) in Spartan3 for clock synthesis - highest possible speed?
    ... generator) using a simple phase accumulator (adder) and two registers. ... I take the MSB of the feedback register as my synthesised clock. ...
    (comp.arch.fpga)