Chapter 3 Tomasulo Hardware

3.1 Overview

3.1.1 Gates, Circuits, Cost and Delay

The hardware model used in this thesis is presented in [MP95]. The following sections just give a really short overview.

For calculation of cost and delay the methods and formulae presented in [MP95] will be used. In particular, the overall calculation is also done by transforming all the complex formulae into a C-program, which is discussed in chapter 5. Thus, cost and delay formulae are omitted in the following chapters.

Figure 3.1 lists the symbols of the basic gates used in the designs. In addition, the following basic circuits are used: n-bit adder / incrementer, n-bit multiplexer, tristate driver, n-bit register, n-bit decoder / encoder, n-bit zero tester, the generic parallel prefix circuit, RAM, shifter, and ALU. A detailed description and the cost and delay formulae can be found in [MP95].

Figure 3.1: Symbols of the basic gates

Furthermore, the hardwired control described in this chapter requires two additional basic circuits: the n-bit find first one circuit (FFO) and the find last one circuit (FLO). They calculate the following functions:

A recursive construction of the circuits and the cost and delay formulae are given in appendix A.1.

3.1.2 The Pipeline Stages

In this chapter, the complete hardware of a DLX RISC core with Tomasulo scheduling is presented. Chapter 4 extends the design with an interface to main memory for load/store operations.

The design is based on the DLX implementations published in [MP95, MP98, Lei98]. It basically consists of a five stage pipeline. The first stage (IF) performs the instruction fetch. In the second stage (D/I), the fetched instruction word is decoded and passed into an appropriate reservation station. The third stage (EX) contains the actual function units, which execute the instruction. Fast function units (i.e., one cycle latency) combine execution and dispatch in one cycle. For slow function units, the execute phase might take several cycles. In the fourth stage (completion), the result of the instruction is stored in the reorder buffer. The fifth stage (WB) performs the writeback of the result into the register file.

3.1.3 Environments

The CPU consists of environments. Figure 3.2 gives an overview of the data paths and the interconnection of their environments.

Figure 3.2: Overview of the data paths

The PC environment contains the PCs of stage 0 and 1 and performs PC calculations.
The instruction memory environment performs the actual instruction fetch and is the interface to the instruction memory or cache.
The IR environment contains the instruction register of the decode/issue stage (IR1).
The decode/issue environment decodes the fetched instructions and distributes the instructions among the function units. It also contains the main control automaton.
Each function unit, including the data memory environment, has its own set of reservation stations assigned to it. Each set has an independent control circuit. The reservation station environments belong to the decode/issue stage.
The function unit environments contain the function units, e.g., the ALU, the floating point units, and the data memory interface.
The CDB control environment allocates the CDB to the reservation stations.
The reorder buffer environment contains the reorder buffer and its control circuit. It also contains large parts of the interrupt handling circuitry and belongs to the completion stage.
The register file environment holds the register files and belongs to the writeback stage.

A detailed description of the individual environments follows.

3.2 The PC Environment

The PC environment (figure 3.3) contains the program counter PC. It is almost identical with the PC environment found in [Lei98]. The PC register of stage 0 PC0 is used for the instruction fetch (section 3.3). After the instruction fetch, the value of this register is saved in oPC1. Furthermore, the PC environment calculates the new value of the PC register. This is done in dependence of several control signals, which are generated by the main control (section 3.5):

Figure 3.3: PC environment

Usually, the new value of the PC0 register is the old value incremented by four. In case of a branch, rfe, jump or an interrupt, the PC register has to be clocked with another value. In these cases, the setPC signal is set. The signal is calculated as follows:

setPC = JISRrfe Ú op.branch Ú op.jump Ú op.jumpR

The op.branch, op.jump and op.jumpR are active in case of a branch/jump instruction and are calculated by the decode/issue environment. In case of a branch or jump instruction (jumpR=0), co1 (the immediate constant) is the target offset. In case of a jump register instruction (jumpR=1), the signal op1.l.data (low part of the first operand) is the new PC. The operands are provided by the decode/issue environment.

The JISRrfe signal is set iff the cause for the active setPC signal is an interrupt or a rfe instruction. Interrupts are indicated by the JISR signal, which is calculated in the ROB environment. In case of an interrupt, the address of the interrupt service routine (SISR) is clocked into the PC0 register.

The processing of rfe instructions affects two cycles. In the first cycle after a rfe instruction, the value of the EPC special purpose register is used as address for the instruction fetch. In this cycle, the DOrfe signal is active. This signal is provided by the decode/issue environment. In the second cycle after a rfe instruction, the value of the EPCn register is used. In this cycle, the rfeEPCn signal is active. The signal is the DOrfe signal, which is delayed by one cycle with a register (figure 3.4). Thus, the JISRrfe signal is calculated as follows:

JISRrfe = JISR Ú DOrfe Ú rfeEPCn

Figure 3.4: rfeEPCn register

During an issue stall (issuestall=1), all clock enable signals are disabled in order to prevent modifications of the PC registers.

3.3 Instruction Memory Environment

The instruction memory environment (figure 3.5) performs the actual instruction fetch and is the interface to the instruction memory or first level instruction cache. The instruction memory environment fetches the instruction word pointed to by the signal pc0, which is provided by the PC environment. This instruction word is returned as signal ir0. Ibusy and pff0 are signals generated by the instruction memory. Ibusy is active iff the instruction memory is temporary unable to return the requested value, e.g., because of a cache miss. The pff0 signal is used to implement virtual memory and indicates a page fault. The instruction memory system only supports word aligned memory accesses. In case of a misaligned access, the imal0 signal is active. It is calculated as follows:

imal0 = pc0[0] Ú pc0[1]

In case of an interrupt (JISR=1), in case of a misaligned instruction word, or if the instruction memory system is unable to return the requested instruction word, zero is returned instead. This is the opcode for a left shift of register R₀ over 0 bits, thus, it is a NOP instruction.

Figure 3.5: Instruction memory environment

3.4 Instruction Register Environment

The instruction register (IR) environment (figure 3.6) contains the instruction register IR1, the page fault register PFF1, and the instruction misaligned register IMAL1 of the decode/issue stage. The instruction register holds the instruction word fetched by the PC environment.

Figure 3.6: Instruction register environment

For some instructions, the instruction word provides an immediate constant. The instruction register environment contains the co1gen circuit (figure 3.7), which extracts the immediate constant from the instruction in IR1 and performs a sign extension to 32 bits (signal co1). It is literally taken from [MP95]. The op.jjump and op.rtype signals are generated by the decode/issue environment and are used to determine the width of the constant and the position in the instruction word.

Figure 3.7: Immediate constant generation co1gen

3.5 Decode/Issue Environment

The decode/issue environment serves two purposes: It decodes the instruction word in IR1 and it distributes the instructions and their operands among the reservation stations. A DLX floating point instruction can have up to four source operands (two registers, IEEEf and the interrupt mask), therefore, four operand busses (op1 to op4) originate in the decode/issue environment.

3.5.1 Decoding the Instruction Word

The decoding of the instruction word is done by the opgen (operation generation) circuit. The circuit opgen generates several control signals from the instruction word found in IR1 with a control automaton. This environment is almost literally taken from [Lei98]. A figure is omitted therefore.

The automaton has two parts, ID1 and ID2. The first part, ID1, has an automaton state assigned to each instruction. Table 3.1 contains the states and the monomials which are used to compute the new state. The state is never stored in any register, it is just used to compute the active control signals. Table 3.1 also lists the control signals which are active in a given state.

The itype, rtype, and jtype control signals correspond to instruction word formats of the same denominator (appendix C), iuFOP indicates an unimplemented floating point instruction, ill indicates an illegal instruction word. The signals fp and db specify whether floating point (fp=1) or double precision (db=1) values are involved. The signals with names beginning with "FU." indicate the function unit which is required to process the instruction. All other signals specify the action to be performed by this function unit.

Depending on its state, ID1 generates further control signals, which are used to select the correct source and destination operands (table 3.2). For example, op1.RS1 indicates that the register addressed by RS1 is expected on operand bus one. RS1, FS1, RS2, FS2, RD, FD and SA correspond to bit fields in the instruction word, which contain the desired register address (appendix C). R31, FCC, RM and MASK are constant register addresses (table 3.15 contains the coding of the special purpose register addresses). The op2.imm signal is active iff the immediate constant in the instruction word is operand two.

The second part of the automaton, ID2, is only used by branches. In case of a conditional branch, it computes the bjtaken signal, which is active iff the branch is to be taken. If the instruction is not a branch, the signal is undefined. The ID2 automaton requires two input signals: AEQZ and FCCEQZ. AEQZ is active iff the source operand of a conditional branch is zero. FCCEQZ is active iff the FCC bit is zero. Both FCC and the beqz/bnez operand are on the low part of operand bus op1 (table 3.2). Thus, op1.l.data is tested:

Target Active control Monomials

state signals IR[31:26] IR[6] IR[5:0]

ALU rtype, FU.alu 000000 * 0001**

000000 * 10****

Shifti rtype, FU.alu 000000 * 0000**

ALUi itype, FU.alu 0*1*** * ******

Load itype, load, FU.mem 100*** * ******

Load.s itype, load, fp, FU.mem 110001 * ******

Load.d itype, load, fp, db, FU.mem 110101 * ******

Store itype, store, FU.mem 101*** * ******

Store.s itype, store, fp, FU.mem 111001 * ******

Store.d itype, store, fp, db, FU.mem 111101 * ******

Faddsub.s rtype, faddsub, fp, FU.fadd 010001 0 00000*

Faddsub.d rtype, faddsub, fp, db, FU.fadd 010001 1 00000*

Fmul.s rtype, fmul, fp, FU.fmul 010001 0 000010

Fmul.d rtype, fmul, fp, db, FU.fmul 010001 1 000010

Fdiv.s rtype, fdiv, fp, FU.fdiv 010001 0 000011

Fdiv.d rtype, fdiv, fp, db, FU.fdiv 010001 1 000011

Fcond.s rtype, fcc, fp, FU.ftest 010001 0 11****

ID1 Fcond.d rtype, fcc, fp, FU.ftest 010001 1 11****

Fabsneg.s rtype, fabsneg, fp, FU.fconv 010001 0 00010*

Fabsneg.d rtype, fabsneg, fp, db, FU.fconv 010001 1 00010*

Ff2i rtype, ff2i, fp, FU.fconv 010001 * 001001

Fi2f rtype, fi2f, fp, FU.fconv 010001 * 001010

FMov.s rtype, fmov, fp, FU.fconv 010001 0 001000

FMov.d rtype, fmov, fp, db, FU.fconv 010001 1 001000

FConv.s rtype, fconv, fp, FU.fconv 010001 * 100*00

FConv.d rtype, fconv, fp, db, FU.fconv 010001 * 100001

Branch itype, bjjr, branch, noFU 00010* * ******

FBranch itype, bjjr, branch, fp, noFU 00011* * ******

JumpReg itype, bjjr, bjtaken, jumpR, noFU 010110 * ******

Jump&LinkReg itype, jalr, bjtaken, jumpR, noFU 010111 * ******

Jump jtype, bjjr, bjtaken, jump, noFU 000010 * ******

Jump&Link jtype, jalr, bjtaken, jump, noFU 000011 * ******

Trap jtype, trap, noFU 111110 * 000000

RFE jtype, rfe, noFU 111111 * ******

Movs2i rtype, movs2i, FU.alu 000000 * 010000

Movi2s rtype, movi2s, FU.alu 000000 * 010001

FUnimp iuFOP, noFU 010001 * 00011*

010001 * 01****

Illegal (z₀) ill, noFU -

Taken bjtaken AEQZ · /IR1[26]

/AEQZ · IR1[26]

ID2 FCCEQZ · /IR1[26]

/FCCEQZ · IR1[26]

Untaken /taken

Table 3.1: States, active control signals and DNFs

State Instructions op1. op2. op3. op4. dest.

ALU add, sub, test/set, shift RS1 RS2 - - RD

ALUi addi, subi, test/set immediate RS1 imm - - RD

Shifti shift with shift amount RS1 imm - - RD

Load load GPR RS1 - - - RD

Load.s load single precision FPR RS1 - - - FD

Load.d load double precision FPR RS1 - - - FD

Store store GPR RS1 RD - - -

Store.s store single precision FPR RS1 FD - - -

Store.d store double precision FPR RS1 FD - - -

Faddsub.s fadd.s, fsub.s FS1 FS2 RM MASK FD

Faddsub.d fadd.s, fsub.s FS1 FS2 RM MASK FD

Fmul.s fmul.s FS1 FS2 RM MASK FD

Fmul.d fmul.d FS1 FS2 RM MASK FD

Fdiv.s fdiv.s FS1 FS2 RM MASK FD

Fdiv.d fdiv.d FS1 FS2 RM MASK FD

Fcond.s fc.cond.s FS1 FS2 - MASK FCC

Fcond.d fc.cond.d FS1 FS2 - MASK FCC

Fabsneg.s fabs.s, fneg.s FS1 - - - FD

Fabsneg.d fabs.d, fneg.d FS1 - - - FD

Ff2i mf2i FS1 - - - RS2

Fi2f mi2f RS2 - - - FS1

FMov.s mov.s FS1 - - - FD

FMov.d mov.d FS1 - - - FD

FConv.s cvt.s.d, cvt.s.i, cvt.i.s, cvt.i.d FS1 - - - FD

FConv.d cvt.d.i, cvt.d.s FS1 - - - FD

Branch beqz, bnez RS1 - - - -

FBranch fbeqz, fbnez FCC - - - -

JumpReg jr RS1 - - - -

Jump&LinkReg jalr RS1 - - - R31

Jump j - - - - -

Jump&Link jal - - - - R31

Trap trap - - - - -

RFE rfe - - - - -

Movs2i movs2i SA - - - RD

Movi2s movi2s RS1 - - - SA

Table 3.2: Operands and bus use

3.5.2 Function Unit Availability Test

As mentioned above, the control automaton ID1 determines which function unit is required to process the instruction in IR1. Table 3.3 lists all function units with their purpose and the control signals used to identify them. For each function unit, a single FU[i] is defined in order to simplify notation.

The FUtest circuit of figure 3.8 tests whether this function unit is available. This is done as follows (n denotes the number of function units):

FUbusy = Ú_i=0^n-1 (FU_i.full Ù FU[i])

The FU_i.full signals are generated by the reservation station controls of the function units. FU_i.full is active iff the reservation stations of the corresponding function unit are not able to accept an instruction.

The signal set D.FU_i.issue specifies the function unit which is actually used for issue. These signals are disabled in case of an issue stall, which is indicated by the issuestall signal.

FU Purpose

FU[0] = FU.alu integer instructions, movi2s, movs2i

FU[1] = FU.mem load, store

FU[2] = FU.fadd floating point addition and substraction

FU[3] = FU.fmul floating point multiplication

FU[4] = FU.fdiv floating point division

FU[5] = FU.fconv conversion floating point / integer

FU[6] = FU.ftest floating point condition tests

Table 3.3: Coding of the function units

Figure 3.8: Function unit availability test FUtest

3.5.3 Operand Address Generation Agen

The decode/issue environment also provides the operands of the instruction. For the source operands, the values are provided, if available. If they are not available, the decode/issue environment provides the appropriate instruction tag to the reservation stations. This is done by the Agen and datagen circuits. For the destination operand, the type and address is determined. This is done by the destgen circuit. Figure 3.9 gives an overview of these circuits.

Figure 3.9: Generation of the operands

The operand address generation circuit Agen (figure 3.10) calculates the types and addresses of the source registers. For each operand, the operation generation environment opgen provides signals, which point to bit fields in the instruction word. In turn, these bit fields contain the register addresses of the operands. The type of an operand is represented by five signals:

The signals opi.fpr, opi.gpr, opi.spr denote the register file which holds the operand, i.e., the floating point, general purpose, and special purpose register file.
The signal opi.db indicates a double precision floating point register.
The signal opi.imm is set iff the operand is the immediate constant.

The amount of different operand types is limited for certain operand busses (table 3.2). Operand bus op1 is used for the registers pointed to by RS1 / FS1 (they share the same bit field), RS2 (for mi2f) and SA (for movs2i). Furthermore, it is supposed to provide the value of the FCC special purpose register to process the fbeqz and fbnez instructions. The immediate constant is never on operand bus op1.

The second operand bus op2 has to provide the registers pointed to by RS2 / RD / FS2 / FD (they share the same bit field). Furthermore, it is supposed to provide the immediate constant for ALU operations.

The operand bus op3 is only used for the rounding mode RM, which is required for many floating point instructions. Operand bus op4 is only used for the interrupt mask, which is also required for floating point instructions. For op3 and op4, no Agen circuit is necessary, since they are always used for the same register.

Figure 3.10: Operand address generation Agen

3.5.4 Operand Data Generation datagen

The operand data generation circuits generate the source operands from the addresses and types provided by the Agen circuit. These operands are distributed by four global data paths (table 3.4). The operand busses transport the operands to the reservation stations. Each operand (op1.l, op2.h, op2.l, op2.h, op3.l, op4.l) consists of three components, which are the tag, the valid bit and the operand data (table 3.5), J denotes the tag width in bits (section 3.9). Each operand bus has a datagen environment of its own. The environments for op1 and op2 are identical (figure 3.11). The operands three and four do not have a high part and are only used for two fixed special purpose registers. Thus, they have a special datagen environment (figure 3.12) in order to save hardware cost.

The operand data generation environment datagen for op1 and op2 generates one operand according to the signals generated by the operand address generation environment Agen. The low and the high part of the operand are calculated separately, since each part might come from a different source. For each operand and for each part, one of the following cases applies: it is the immediate constant, it is in the register file, it is a result currently on the CDB, it is in the ROB, or it is the result of an instruction which has not yet completed (figure 2.5). Thus, four cascaded multiplexers are used to select the data from the appropriate source.

Bus Items Width Purpose

op1 l J+32+1 low part of the first operand

h J+32+1 high part of the first operand

high 1 lowest bit of the register address

op2 l J+32+1 low part of the first operand

h J+32+1 high part of the first operand

high 1 lowest bit of the register address

op3 l J+32+1 third operand (always integer)

op4 l J+32+1 fourth operand (always integer)

Table 3.4: Components of the global data paths

Item Width Purpose

tag J ROB tag of the instruction producing the operand

valid 1 valid =1 Û operand contains valid data

data 32 actual operand data

Table 3.5: Components of an operand

Operand is the Immediate Constant

The first step is checking whether the operand is the immediate constant (opi.imm=1) or not. If so, the low part of the operand is returned as follows:

In this case, the operand is valid already during issue. The data value is generated by the co1gen circuit (section 3.4). The high part of the operand can never be the immediate constant, thus, it is assumed to be zero in this case to have a defined value on the bus. The high part is also set to zero if the operand is not a double precision floating point value.

Operand is in the Register File

If the operand is not the immediate constant, it must be a register. Thus, the second step to get the operand is looking up its valid bit in the producer table. If the valid bit is set, the operand is in the register file. The operand address generation environment provides the necessary address signals opi.A, opi.fpr, opi.gpr, and opi.spr to the register files and to the producer tables, which return the requested values as opi.l/h.RF.Dout (register file) and opi.l/h.Prod.Dout (producer table), i Î {1,...,2}. The registers RM and SR for operand three and four are directly provided by the SPR environment as SPR.RM and SPR.SR.

If the operand part is in the register file (opi.l/h.Prod.Dout.valid=1), the operand bus is set to the following values:

Operand is on the CDB

If not so (opi.l/h.Prod.Dout.valid=0), the producer table contains the tag of the instruction which produces the desired value. Since this value might be on the CDB in the current cycle, the tag retrieved from the producer table is compared with the tag on the CDB. If both tags are equal and if the valid bit of the CDB is active, the operand is forwarded from the CDB:

Operand is in the Reorder Buffer

The operand might also be in the reorder buffer. The tag found in the producer table is already the proper index for the ROB to check whether the result is already in the ROB. If so, the valid bit of the ROB entry is set. For this task, ports one to six of the ROB are used. Ports one and two are for op1.l and op1.h, ports three and four are for op2.l and op2.h, and ports five and six are for op3 and op4.

Operand is a Result of an Uncompleted Instruction

If none of the cases above applies, the operand must be a result of an uncompleted instruction. The tag of this instruction can be found in the producer table. In this case, the operand is not yet valid and the tag is turned over to the reservation station. The data signal is set to zero in order to have a defined value on the bus.

Figure 3.11: Operand data generation datagen for op1 and op2

Figure 3.12: Operand data generation datagen for op3 and op4

3.5.5 Destination Operand Generation destgen

The destination operand generation environment destgen calculates the type and the address of the destination register. This circuit is similar to the address generation environment, which performs the same task for the source operands. The register type of destination is determined according to table 3.2 as:

The destination is a double precision floating point value if the op.db signal is active or if it is a cvt.s.d or ctv.i.d instruction. These instructions can be distinguished from the other cvt instructions by IR[6] (appendix C).

The destination register address is extracted from bit fields of the instruction word (appendix C). The positions of these bit fields depend on the instruction word layout, which is specified by the itype and rtype signals.

3.5.6 Stall Generation stallgen

Issue stalls occur if one or more of the following conditions hold:

For the given instruction, all appropriate reservation stations are busy (FUbusy is active, section 3.5.2).
The instruction has to be stored in the reorder buffer, but the reorder buffer is full (ROB.full is active, section 3.9).
If the instruction is a moves2i and the source register is IEEEf, an issue stall is performed until the ROB is empty, to ensure that the register file contains the correct value. This is necessary, since floating point instructions modify the IEEEf special purpose register without any note in the producer table. The signal IEEEfstall is active under this condition:

Alternatively, a check for floating point instructions in the reorder buffer would be sufficient and could increase IPC rates at higher hardware cost.
The instruction is a conditional branch or a jump register instruction and the source operand op1 is not yet available. Issuing these instructions would require speculative execution, which is part of a thesis by Mark A. Hillebrand [Hil99]. The signal bstall indicates this stall condition:
If the instruction is a rfe instruction, an issue stall is required until the ROB is empty. This ensures that the ESR, EPC, and EPCn registers contain the correct values, since they might be modified by an instruction or interrupt prior the rfe instruction. This condition is indicated by the signal rfestall:

In the cycle after the stall, DOrfe is activated. This signal causes the actual register transfers, which are done in the PC environment and in the register file environments.

DOrfe=op.rfe Ù ROB.empty
The instruction fetch and issue stages have to be stalled if the instruction memory system is busy (IBusy is active) in order to prevent the destruction of the PC registers.

Furthermore, interrupts overrule any issue stall condition. This is done since the instruction, which causes the interrupt, is always ahead of the instruction which causes the issue stall. Thus, the issuestall signal is generated as:

In case of an issue stall, the following actions are performed:

The instruction fetch is stalled. This is done by disabling the clock enable signals of PC0, PC1, and IR1(section 3.2).
All D.FU_i.issue signals are disabled (section 3.5.2) in order to prevent that the instruction is written into a reservation station.
The instruction is not stored in the reorder buffer (section 3.9).
The producer table is not modified (section 3.5.5).

3.6 The Reservation Station Environments

3.6.1 Overview

Each function unit has its own set of reservation stations assigned to it. Figure 3.13 gives an overview of a function unit with reservation stations and the producer. The dashed paths and circuits are extensions only needed for floating point function units.

Figure 3.13: A complete function unit with reservation stations

The reservation stations form a queue for instructions and their operands which are provided on the op1 to op4 busses. These busses originate in the decode/issue environment. In each cycle, any desired instruction can move from its reservation station into the function unit. For this purpose, all reservation stations are connected to a bus with tristate drivers. The bus and the reservation stations are controlled by the reservation station control.

If the function unit is a floating point unit, the data on this bus is adjusted in the single-adjust-one circuit. This circuit makes sure that single precision values are in the lower 32 bits of the bus. After leaving the function unit, the single-adjust-two circuit makes sure that single precision values are on both lower and higher 32 bits of the bus. After that, the result is propagated on the CDB by the producer circuit.

Integer function units do not need the single adjust circuits. The instruction is passed unmodified to the function unit. After leaving the function unit, the result is passed unmodified to the producer.

3.6.2 Operation of the Reservation Stations

As mentioned above, the reservation stations of a function unit j form a queue for the instructions and their operands. Let the queue have n_j reservation stations. The design in this thesis allows any number of reservation stations. The choice of n_j only depends on cost effectiveness. Chapter 5 contains a comparison of different assignments.

New instructions are always issued in-order into the first reservation station (reservation station 0). The input values for reservation station 0 are generated by the decode/issue environment.

For each operand of an instruction, a valid bit and tag bits are stored in the reservation station. The valid bit is set iff the operand is already in the data item of the reservation station. An instruction in a reservation station is said to be valid if all its operands are available, i.e., valid. If not so, the tag bits hold the tag of the instruction which generates the operand. In this case, the operand circuits snoop on the CDB for the missing operands. The operand circuit compares the tag on the CDB to the tag stored in its register. If both are equal and if the valid bit of the CDB is active, the data item of the CDB is clocked into the data item of the operand.

As soon as one or more instructions in the queue become valid, the oldest among these instructions is dispatched into the function unit and removed from the queue. The reservation station control calculates the necessary output enable signals.

In each cycle, an instruction in reservation station i moves into reservation station i+1, unless reservation station i+1 is full and cannot be freed by moving its content into reservation station i+2 or by dispatching the instruction into the function unit. The reservation station control calculates the necessary clock enable signals.

3.6.3 Implementation of the Reservation Stations

Each reservation station can hold the operation code and the operands of one instruction. An implementation of an integer reservation station is given in figure 3.14. The reservation station has a register for the full bit, the tag bits and an operation code op. The full bit indicates that the reservation station is in use. The tag data item is the ROB tag of the instruction in the reservation station. The coding of the op data item depends on the interface to the function unit.

The values in reservation station i are updated if the RS_i.fill signal is active. The new values for the reservation station are selected in dependence of the RS_i.clear signal. If active, the reservation station is filled with an empty entry. If not active, the content of the previous reservation station RS_i-1 is copied. The reservation station is also cleared in case of an interrupt, as indicated by the JISR signal. Thus, the content of RS_i is calculated as follows:

The clear signal only affects the op, tag, and full bits, which are set to zero by a multiplexer. The other registers of the reservation station are not cleared in order to save hardware cost.

Integer function units require two 32 bits wide operands. Each operand has its own box (figure 3.15). Each operand has three components, which are the valid, tag, and data component. The valid bit is set iff the operand is already in the reservation station, i.e., in the data component of the operand register. If not so, the tag bits contains the ROB tag of the instruction which produces the operand. Reservation station operands are updated in two ways: The first way is to copy the content of the same operand in the previous reservation station. This is done iff the fill signal is active. The second way is to copy the content of the corresponding components of the CDB. This is done if the readCDB signal is active, which is calculated as follows:

If readCDB is active, the reservation station operand provides the new value (i.e., the value on the CDB) as output to the next reservation station and to the function unit. The forwarding of the CDB data is essential for the following reasons: The operand is only one cycle on the CDB. If the data in a reservation station moves into the next reservation station, the operand on the CDB must be written into the next reservation station. Table 3.6 lists how the content of a reservation station operand is calculated.

RS_i.opx RS_i-1.opx new value of

fill readCDB readCDB RS_i.opx.data

0 0 * RS_i.opx.data

0 1 * CDB.data

1 * 0 RS_i-1.opx.data

1 * 1 CDB.data

Table 3.6: Calculation of the new value of a reservation station operand

Furthermore, the valid signal of the reservation station operand becomes active in the same cycle in which readCDB is active. This allows dispatching instructions in the same cycle they received their operands via the CDB. This is a performance optimization only and does not affect correctness.

Floating point function units require six operands: two 64 bits wide operands (split into low and high part, respectively) and the rounding mode rm and the interrupt mask mask. The interrupt mask is needed by the rounder, since the result of an IEEE floating point operation depends on the interrupt mask [Ins85, EP97]. The implementation of the floating point reservation station is identical to the implementation of an integer reservation station except for the additional operands. An implementation of a floating point reservation station is in figure 3.16. The implementation of the operand circuits of a floating point reservation station is identical to the implementation of the operand circuits of an integer reservation station.

Figure 3.14: Reservation station for integer function units

Figure 3.15: Reservation station operand

Figure 3.16: Reservation station for floating point function units

3.6.4 Reservation Station Control

Dispatch Control

The reservation station control (figure 3.17) autonomously governs the dispatch of the valid instructions of the reservation stations into the function unit. Let RS₀,...,RS_n_{_j}_-1 be the reservation stations of function unit FU_j.

The RS_i.doe signal is set iff the instruction in reservation station i is dispatched into the function unit. This transfer is done by a special bus. Each reservation station can write on this bus. RS_i.doe is the output enable signal of the bus driver of reservation station i.

The correctness proof (chapter 6) relies on choosing the oldest among the valid instructions. Since new instructions are always placed in reservation station 0, the oldest valid instruction is obviously in the reservation station with the highest index among the valid reservation stations. Let RS_a be the reservation station which is to become dispatched.

a = max{i Î {0,...,n_j-1} | RS_i.valid=1}

The RS_i.valid signals are provided by the reservation stations. The max is calculated by a find last one (FLO) circuit (appendix A.1). The circuit returns a in unary representation. Let A_i denote this output. The dispatch has to be stalled if the function unit is stalled as indicated by the FUstall signal:

At most one of the RS_i.doe signals has to be set in order to prevent bus contention. This is ensured by the find last one circuit.

Flow Control Signals

The reservation station control generates two flow control signals: The RSvalid and FU_j.full signal.

The RSvalid signal is active iff data is dispatched into the function unit. This is true iff there is at least one valid reservation station and the function unit is not stalled. Thus:

The find last one circuit (appendix A.1) has a built-in zero tester, so that the signal can be generated with a single NOR gate.

The FU_j.full signal is active iff the reservation stations of function unit j are not able to accept an instruction. However, even if all reservation stations of a function unit are full, an instruction can be issued into a reservation station by dispatching one instruction into the function unit if the function unit itself is not stalled.

In case of an active FU_j.full signal, the decode/issue environment does not generate a FU_j.issue signal for the function unit.

Queue Control Signals

The reservation station control also computes the RS_i.fill and RS_i.clear signals. As described above, RS_i.fill is active iff entry i is to be filled with new values. The RS_i.clear signal controls whether to clear the reservation station or to copy the data of its predecessor RS_i-1. In case of the first reservation station RS₀, the data of the predecessor is the instruction provided by the decode/issue environment. The clear signal of a reservation station is only used in the following cases:

It is used if the entry of the previous reservation station is dispatched and therefore leaves the reservation station queue.
It is used in case of the first reservation station, if no instruction is issued into the first reservation station.

The calculation of the queue control signals is non-trivial and recursively defined as follows:

i=n_j-1:

The last reservation station does not have a successor. It is filled with the data of its predecessor iff its content is dispatched into the function unit (RS_n_{_j}_-1.doe=1) or if it is empty (RS_n_{_j}_-1.full=0):

If the content of the predecessor (i.e., RS_n_{_j}_-2) is dispatched into the function unit, it must not become copied. Thus, the clear signal of the last reservation station is active in this case:

RS_n_{_j}_-1.clear = RS_n_{_j}_-2.doe

Table 3.7 contains a list of the possible values.

RS_n_{_j}_-2.full RS_n_{_j}_-1.full RS_n_{_j}_-2.doe RS_n_{_j}_-1.doe RS_n_{_j}_-1.clear RS_n_{_j}_-1.fill action in RS_n_{_j}_-1

0 0 0 0 0 1 copy previous RS, which is empty

0 1

1 0 not possible

1 1

0 1 0 0 0 0 no action

0 1 0 1 copy previous RS, which is empty

1 0 not possible

1 1 not possible

1 0 0 0 0 1 copy instruction in previous RS

0 1 not possible

1 0 1 1 clear RS, although already empty

1 1 not possible

1 1 0 0 0 0 no action

0 1 0 1 replace the current instruction with instruction in previous RS

1 0 1 0 no action

1 1 not possible

Table 3.7: Deduction of the RS_n_{_j}_-1.clear and RS_n_{_j}_-1.fill signals

i Î {1,...,n_j-2}:

For RS₁ to RS_n_{_j}_-2, the calculation of the RS_i.fill signal is slightly modified, since these reservation stations have a successor. The signal RS_i.fill is also active if RS_i+1 takes over the content of RS_i.

The calculation of the clear signal is identical to the calculation in the previous case.

RS_i.clear = RS_i-1.doe

i=0:

The first reservation station does not have a predecessor. The input values for the first reservation station are provided by the decode/issue environment. These values are only valid if an instruction is issued into the first reservation station of function unit j. Thus, the reservation station is filled with an empty entry except on issue:

The calculation of the fill signal of reservation station zero is identical to the calculation in the general case.

In order to resolve the recurrency in the formulae of the RS_i.fill signals, define a set of signals F_j(i) as:

Now, a closed formula for RS_i.fill can be specified for i Î {0,...,n_j-1}:

RS_i.fill = Ú_k=iⁿ_^j^-1 F_j(k)

Since OR is associative, a parallel prefix circuit can be used in order to compute the RS_i.fill signals (figure 3.17).

Figure 3.17: Reservation station control

Correctness

The correctness of the calculation of the queue control signals RS_i.fill and RS_i.clear is an implication of the following three claims:

Claim 1: Issued instructions are stored in RS₀.

Proof of Claim 1: During issue, D.FU_j.issue is active (page ??), and therefore FU_j.full is inactive. This implies that there is either an empty reservation station or that there is a reservation station which is being dispatched. In either case, there is at least one reservation station RS_i with RS_i.fill=1. Thus, RS₀.fill is active, and RS₀.clear is inactive. The instruction is therefore stored in RS₀.

Claim 2: No instruction in a reservation station gets lost, i.e., it is either dispatched to the function unit or remains in a reservation station.

Claim 3: Reservation stations, which are copied or dispatched, are cleared or overwritten afterwards. Reservation stations, which are dispatched, are not copied.

Proof of claim 2 and 3: Let instruction I be in RS_i. If instruction I is dispatched (i.e., RS_i.doe is active), claim 2 is obvious. Claim 3 follows from RS_i.fill=1 and RS_i+1.clear=1.

Let instruction I not be dispatched (i.e., RS_i.doe=0). We will now show that I then either moves to the next reservation station RS_i+1 or stays in RS_i. For that purpose, we distinguish the cases that the signal RS_i+1.fill is active or inactive.

Let RS_i+1.fill=1. The RS_i+1.clear signal is inactive because of RS_i.doe=0, and thus the entry is copied into RS_i+1 and claim 2 follows. Claim 3 holds because of RS_i.fill=1, which is true because of RS_i+1.fill=1.

Let RS_i+1.fill=0. The claim 3 does not apply because the entry is neither dispatched nor copied. Claim 2 only applies for RS_i if it contains an instruction, i.e., RS_i.full=1. We distinguish two cases:

If i is n_j-1, i.e., if the reservation station is the last reservation station in the queue, RS_i.fill is calculated as:

Since RS_n_{_j}_-1.doe=0 and RS_n_{_j}_-1.full=1, the fill bit RS_n_{_j}_-1.fill is inactive, and the entry therefore remains in the queue.
If i is not n_j-1, RS_i.fill is calculated as:

3.6.5 Single Adjust

The single adjust circuits are only used in floating point function units. They are controlled by three signals, which are stored in the op data item of each reservation station. The op1.high, op2.high, and dest.high signals are the least significant bits of the register address of operand one, two, and the destination, respectively. The db signal is active iff the operation has double precision source registers.

Before the function unit, the operands one and two from the reservation station pass the single-adjust-one circuit. Double precision operands are passed unmodified (opx.high is false in this case, since double precision operands always have even register addresses). If single precision operands are used (the db signal is not active), it ensures that the operand is always in the lower 32 bits of the data path. The upper 32 bits are set to zero, which is a requirement of the floating point function units used in this design. Table 3.8 lists the results of the circuit in dependence of the input signals. An implementation of this function is given in figure 3.18.

Inputs Result

high db low part high part

0 0 data[31:0] 0³²

1 0 data[63:32] 0³²

0 1 data[31:0] data[63:32]

1 1 not possible

Table 3.8: Single adjust before function unit for one operand. The input from the reservation is data[63:0].

Figure 3.18: Single adjust before function unit

After leaving the function unit, the single-adjust-two circuit between the function unit and the producer part of the reservation station reverts this procedure. It ensures that a single precision result is both on the low and on the high part of the CDB to avoid any possible alignment problems. The implementation is given in figure 3.19.

Figure 3.19: Single adjust after function unit

3.6.6 The Producer

The producer (figure 3.20) propagates the results of an associated function unit on the CDB. Table 3.9 lists all components of the CDB. The producer has to generate a value for all components. The function unit provides a signal FUvalid. If FUvalid is set, the function unit delivers a result, result flags (ovf, IEEE flags, etc), and the tag. These values are stored in registers. In the same cycle, the producer requests the CDB for the next cycle at the CDB control by raising FU_j.CDBreq. Let t be the cycle of the request.

In case of an acknowledgement (FU_j.CDBack=1) by the CDB control in cycle t+1, the values in these registers are put on the CDB, and the register is filled with the next result. If the CDB control does not acknowledge the request within cycle t+1, the values stay in the registers, the function unit is stalled with the signal Pstall, and the producer requests the CDB again for cycle t+2.

Let FU_j.P be the register of the producer part of function unit j. The Pstall signal stalls the whole function unit. Alternatively, the function unit might contain a stall engine. The Pstall signal is active, if there is an instruction in the producer register stage (FU_j.P.valid=1) and if there is no acknowledge from the CDB control (FU_j.CDBack=0). The valid bit from the register is forced to be zero in the power-up cycle (pup=1). Thus, Pstall is calculated as follows:

The CDB is requested iff the function unit provides a result (FUvalid=1) or iff there is a result in the register and no acknowledge from the CDB control (Pstall=1).

FU_j.CDBreq = FUvalid Ú Pstall

Figure 3.20: Producer

Bus Items Width Purpose

CDB tag J ROB tag of the instruction producing the result

valid 1 CDB.valid=1 Û CDB contains valid data

data 64 actual result

mal 1 misaligned memory access

Dpf 1 page fault during data memory access

ovf 1 overflow in ALU instruction

IEEEf 5 IEEE conforming floating point flags

EData 32 exception data

Table 3.9: Components of the CDB

3.7 Function Unit Environments

The function unit environments contain the function units, which are the ALU, floating point units, and the data memory interface (table 3.3). These environments belong to the execute stage.

3.7.1 Integer Function Unit

The integer function unit performs traditional ALU functions and shifting. Table 3.10 defines the interface to this function unit. It contains the coding of the operation control signals op[4:0]. The unit can generate one exception (FXU overflow). The exception can be suppressed by a bit in the opcode. This test is done in the ALU itself.

Figure A.4 (appendix A, page ??) gives the implementation of the integer function unit, which is taken almost literally from [MP95]. During issue, the op[] signals are calculated from corresponding bits in the instruction word as follows: The ALU function is defined by bits in the opcode. The position of these bits depends on the instruction format. Instructions with itype format (op.itype=1) use IR[30] and IR[28:26] for this task. Instructions with rtype format use IR[5:0]. The circuit in figure 3.21 selects the correct signals.

op[4] op[3] op[2] op[1] op[0] Function

0 0 0 0 0 a << b

0 0 0 1 0 a >> b

0 0 0 1 1 a >> b (arithmetic)

1 0 0 0 0 a+b with test of overflow

1 0 0 0 1 a+b without test of overflow

1 0 0 1 0 a-b with test of overflow

1 0 0 1 1 a-b without test of overflow

1 0 1 0 0 a Ù b

1 0 1 0 1 a Ú b

1 0 1 1 0 a Å b

1 0 1 1 1 b[0:15] 0¹⁶

1 1 0 0 1 a > b ? 1 : 0

1 1 0 1 0 a = b ? 1 : 0

1 1 0 1 1 a ³ b ? 1 : 0

1 1 1 0 0 a < b ? 1 : 0

1 1 1 0 1 a ¹ b ? 1 : 0

1 1 1 1 0 a £ b ? 1 : 0

Table 3.10: Coding of integer operations

Figure 3.21: Calculation of op[4:0] for the ALU

3.7.2 Floating Point Function Units

With respect to cost and delay, the floating point units are taken from [Lei98]. Nevertheless, the design supports any function units which comply with the interface. Since each function unit can generate an independent stall signal, even function units with variable latency can be used.

In contrast to [Lei98], each function unit is assumed to have a rounder of its own. However, this is only relevant for cost, delay, and CPI calculation. The scheduling algorithm itself does support sharing of floating point unit parts between function units, which results in big cost savings. However, since no CPI simulations are available for floating point units with shared rounder, separate rounder are used to keep the machine comparable. For the same reason, the list of floating point function units (table 3.11) is taken from [Ger98]. The table also lists the number of reservation stations which belong to each function unit. Floating point reservation stations are very expensive regarding hardware cost. Thus, it is advisable to combine the multiplication/division FU and the conversion/test FU to save two sets of reservation stations. Again, simulations for this configuration are missing.

One third of the cost of a floating point reservation station is caused by the operand entries for the rounding mode and the interrupt mask. In order to save this cost, it is possible to encode the rounding mode RM in the instruction opcode (there is still room left, appendix C). Furthermore, forwarding of the interrupt mask is not cost efficient, since it changes rarely. It is only required in the rounder which is in the last stages of each floating point function unit. Due of that, it is more cost efficient to design a floating point function unit which directly reads the interrupt mask from the register file as soon as an instruction arrives at the rounder stage. If the interrupt mask is not valid, the function unit could generate a stall signal.

Purpose Latency # RS

floating point addition and substraction 5 2

floating point multiplication 5 2

floating point division 15 1

conversion floating point / integer 4 1

floating point condition tests 1 1

Table 3.11: Floating point function units

The implementation of floating point units is beyond this thesis. The actual operation performed by the FU is determined by IR[8:0]. These bits are forwarded to the reservation station during issue as part of the op bits.

3.8 CDB Control Environment

The CDB control environment allocates the CDB to the function units. The CDB is requested by the producer of the function unit i by raising FU_i.CDBreq. The CDB control environment generates exactly one FU_j.CDBack in the next cycle. Figure 3.22 gives an implementation.

3.8.1 Deduction

Let n be the total number of producers and let FU_i(t).CDBreq and FU_i(t).CDBack be the request and acknowledge signals of function unit i Î {0,...,n-1} in cycle t. Now, R(t) and A(t) are defined as follows: R(t) contains the producers which request the CDB in cycle t. A(t) contains the active acknowledge signals in cycle t.

R(t)	=	{ i Î {0,...,n-1} \| FU_i(t).CDBreq = 1 }
A(t)	=	{ i Î {0,...,n-1} \| FU_i(t).CDBack = 1 }

In each cycle t, multiple producers might be requesting the CDB. The CDB control has to choose exactly one because only one unit can use the CDB. The correctness proof of the Tomasulo scheduling algorithm with reorder buffer requires a guaranty that any unit requesting the CDB will get an acknowledgement within a finite limit of time (chapter 6). This is done by allocating the CDB round robin. This leads to the following algorithm for the calculation of the acknowledge signals:

Since only one unit can get the CDB, a(t) can be defined as:

Thus, it is required that there is exactly one function unit which gets the CDB for each cycle. The producer has to put defined values (with CDB.valid=0) on the CDB if it does not have real data. If there is only one request for the CDB for a given cycle, the calculation of a(t+1) is obvious. In case of more requests, round robin scheduling requires that the CDB is assigned to the unit which comes next after the unit which got the CDB in the previous cycle. In case of the last unit, the next unit is the first one.
M(t) contains the producers which have higher indices than a(t). It is defined as:

M(t) = { i Î {0,...,n-1} | i > a(t) }
The requests from these producers are in R_high(t):

R_high(t) = R(t) Ç M(t)
Now, a(t) can be defined inductively. In the first (powerup) cycle, a is forced to be zero, i.e., the first function unit gets the CDB. The processing of a(t+1) is done as follows: If there are no requests, a remains the same. If there is one or more request, the first step is to check the requests of units in R_high(t). If there are no such requests, the request with the lowest index is acknowledged.

Figure 3.22: Common data bus control

3.8.2 Implementation

The R(t) set, which contains the requests for the CDB, is provided by the producers of the function units as FU_j(t).CDBreq.
The A(t) set is taken from a register, i.e., this register contains the FU_j(t).CDBack signals.
M(t) is the set of producers with higher indices than a(t):

Let M_i(t) denote that i lies in M(t), thus:

M_i(t) = 1 Û i Î M(t)

Now, M_i(t) is calculated as follows:

This calculation is done by a parallel prefix circuit, since OR is associative. The outputs of the parallel prefix circuit are inverted to get the final signals of M.
The implementation of R(t) Ç M(t) is obvious:

j Î ( R(t) Ç M(t)) Û (j Î R(t)) Ù (j Î M(t))

( R(t) Ç M(t))_j = R_j(t) Ù M_j(t)
The two minimum operations for the calculation of a(t+1) are combined in one 2n-bit find first one circuit to save cost and delay. The lower n-bit input signals of the circuit are connected to the R_high(t) bits calculated in the previous step. The input signals n to 2n-1 are connected to R(t)

If R(t) is empty, the zero output of the find first one circuit is active. This disables the clock enable signal of the register which holds the acknowledge signals.

If R_high(t) is not empty, the find first one circuit returns min( R_high(t)) in output bits 0 to n-1. The output bits n to 2n-1 are zero in this case.

If R_high(t) is empty, the find first one circuits returns zero in output bits 0 to n-1 and min( R(t)) in output bits n to 2n-1. The bits n to 2n-1 are mapped onto bits 0 to n-1 with n OR gates. This bit set is A(t+1) and is stored in the register.

3.9 Reorder Buffer Environment

The reorder buffer realizes in-order termination which is essential for precise interrupts. An introduction on reorder buffers is given in chapter 2.

The size of the reorder buffer has a significant impact on the CPI rate [Ger98]. For this thesis, a reorder buffer with Q=16 entries is assumed, which is a cost efficient size, as shown by simulations. The size is assumed to be a power of two. The buffer requires J=log₂ Q = 4 address bits (i.e., tag bits).

The reorder buffer itself is realized with two RAMs: ROB1 and ROB2. Table 3.12 lists all components of the ROB, their purpose and size, and the RAM they belong to. This separation saves cost and delay since the values in ROB2 are only used during retire and for the destination operand during issue. For these values, the forwarding read ports are saved. Table 3.13 shows the use of the ports.

ROB1 is a nine port Q × 105 RAM (figure 3.23). ROB2 is a two port Q × 78 RAM. The reorder buffer is organized as circular FIFO queue. It is addressed by two pointers: ROB.tail is the tail pointer and points to the target entry for new instructions. ROB.head is the head pointer and points to the next instruction for retire.

The head and tail pointers are maintained by two circuits which are identical to the circuits in [Lei98]. They provide ROB.head and ROB.tail and are controlled by two clock enable signals, ROB.headce and ROB.tailce, respectively. If the clock enable signal of a circuit is active, the corresponding pointer is incremented by one (with warp-around) in each cycle. In case of an interrupt, both pointers are set to zero. An implementation of both circuits is in figure A.5 (appendix A, page ??).

Another auxiliary circuit, which is also taken from [Lei98], calculates the ROB.full signal. If set, the signal indicates that the ROB is full. Furthermore, the circuit provides ROB.empty, which is active iff the ROB is empty. The circuit is controlled by the same control signals used for ROB.head and ROB.tail (figure A.6, appendix A, page ??).

3.9.1 Issue

During issue, the ROB entry pointed to by ROB.tail is allocated and initialized for the new instruction. This is done via port eight of the ROB. The write enable signal of this port is active iff issuestall is inactive, i.e., when an issue is performed. During issue, the tail pointer is incremented. The tail pointer is cleared during JISR. Thus, the ROB.tailce signal is calculated as:

The valid data item is initialized with one, iff the instruction is not passed to a function unit. This is indicated by the noFU signal, which is generated by the control automaton. The data and IEEEf items are filled with dummy data to have defined values in the ROB.

The dmal, Dpf, ovf, and IEEEf data items are used for interrupt processing. They indicate exceptions which can occur during the execution phase of an instruction and they are initialized with zero. In case of a trap instruction, EData contains the immediate constant co1 from the instruction register environment, which allows passing of an argument to the interrupt service routine.

The ill, imal, Ipf, trap, and uFOP items of the ROB2 RAM are used for exceptions which occur during fetch or decode/issue. They are initialized with the corresponding signals provided by the decode/issue environment.

The ROB2 RAM contains data items dest, db, gpr, fpr, and spr. These items specify the register file, the register file address and the operand width (double or single) for writeback. They are initialized by the corresponding values generated by the decode/issue environment (section 3.5).

For the calculation of EPC and EPCn after the interrupt, the following additional information is required: the PC of the instruction and the branch/jump target from the PC environment. The bj data item is active iff the instruction in the ROB entry is a branch or jump instruction.

Figure 3.23: Reorder buffer

Name Width ROB Purpose

valid 1 ROB1 valid =1 Û data contains a valid value

data 64 ROB1 result data

dmal 1 ROB1 misaligned data memory access

Dpf 1 ROB1 data memory page fault

ovf 1 ROB1 overflow in ALU instruction

IEEEf 5 ROB1 IEEE flags (only used by floating point instr.)

EData 32 ROB1 exception data

å 105

ill 1 ROB2 illegal instruction

imal 1 ROB2 misaligned instruction memory access

Ipf 1 ROB2 instruction memory page fault

trap 1 ROB2 trap =1 Û instruction is a trap instruction

uFOP 1 ROB2 unimplemented floating point instruction

dest 4 ROB2 destination register address

db 1 ROB2 db =1 Û result has double precision

fpr 1 ROB2 fpr =1 Û dest is a floating point register

spr 1 ROB2 spr =1 Û dest is a special purpose register

gpr 1 ROB2 gpr =1 Û dest is a general purpose register

PC 32 ROB2 PC of the instruction

target 32 ROB2 target / fallthrough address

bj 1 ROB2 bj =1 Û instruction is a branch/jump

å 78

Table 3.12: Components of a reorder buffer entry

Port Use Purpose

1 read only ROB1 Forwarding of low part of operand 1

2 read only ROB1 Forwarding of high part of operand 1

3 read only ROB1 Forwarding of low part of operand 2

4 read only ROB1 Forwarding of high part of operand 2

5 read only ROB1 Forwarding of operand 3

6 read only ROB1 Forwarding of operand 4

7 read only ROB1, ROB2 Retire

8 write only ROB1, ROB2 Issue (destination)

9 write only ROB1 Completion

Table 3.13: Use of the reorder buffer ports

Interrupt Symbol Priority Resume Maskable External

reset reset 0 abort no yes

illegal instruction ill 1 abort no

misaligned access mal 2

page fault IM Ipf 3 repeat

page fault DM Dpf 4

trap trap 5 continue

FXU overflow ovf 6 continue yes no

FPU overflow fOVF 7

FPU underflow fUNF 8 abort/

FPU inexact result fINX 9 continue

FPU divide by zero fDBZ 10

FPU invalid operation fINV 11

FPU unimplemented uFOP 12 continue no

external I/O ex_j 12+j continue yes yes

Table 3.14: Interrupts and coding of SR/CA

3.9.2 Retire

On retire, a result is fetched from the head of the reorder buffer and written into the register file. This is done with ROB port seven. The conditions for retire are that the ROB must not be empty and that the entry at the head is valid.

retire = /ROB.empty Ù ROB.p7.Dout.valid

During retire, the ROB head pointer is incremented. The head pointer is cleared during JISR. Thus, the ROB.headce signal is calculated as:

ROB.headce = JISR Ú ROB.retire

Before the actual writeback, interrupts are checked (almost identical to [MP95], MCA[6] must be masked in contrast to [MP95]). The first step is to collect the occurred interrupts in in CA[i]. CA[i] is active iff an interrupt of priority i occurred. Table 3.14 lists all interrupts and their priority. Lower numbers denote higher priority. CA[0] is the reset interrupt. It is triggered by pup, the power-up signal, in any case, even if there is no instruction to interrupt. CA[1] to CA[12] are internal interrupts. Their event signals are stored in the ROB. These event signals are only valid during retire therefore.

The misaligned access (mal) interrupt indicates both instruction memory and data memory misaligned accesses, which have separate event signals in the ROB. The calculation of CA[2] is different therefore. CA[13] to CA[31] are left over for external interrupts with event signals ex₁ to ex₁₉.

Most interrupts are maskable (table 3.14). The service of interrupt i can be suppressed by setting SR[i] to zero. MCA contains the occurred interrupts which are not masked.

If an interrupt is serviced (i.e., at least one MCA[i] signal is active), the JISR signal is activated.

If the interrupt is of type continue, which is indicated by the IRQcontinue signal, the writeback must take place in spite of the interrupt. Interrupts 5 to 31 are of this type. The writeback is controlled by the writeback signal.

The writeback of the result of the instruction into the register file is performed via register file / producer table port three with the wb (writeback) signals. The address (wb.A) and the register file and the result data is taken from the ROB. These values are also used to address the producer table.

wb.A	=	wb.ROB2.Dout.dest
wb.gpr	=	wb.ROB2.Dout.gpr
wb.fpr	=	wb.ROB2.Dout.fpr
wb.spr	=	wb.ROB2.Dout.spr
wb.db	=	wb.ROB2.Dout.db
wb.l.RF.Din	=	wb.ROB1.Dout.data[31:0]
wb.h.RF.Din	=	wb.ROB1.Dout.data[63:32]

The interface to the register file specifies two write enable signals: The first one, wb.wl, is used for GPR and SPR register files and for the low part of the floating point registers. The second, the wb.wh signal, is only used for the high part of the floating point registers. A single precision floating point register is in the low part, iff its address is even (i.e., wb.A[0] is zero). For double precision values, both write enable signals have to be active.

For the producer table, port three is also used as read port to compare the tag with the address of the reorder buffer entry. If they are equal, the valid bit of the register is set.

wb.l.Prod.w	=	wb.l.RF.w Ù (wb.l.Prod.Dout.tag=ROB.head)
wb.h.Prod.w	=	wb.h.RF.w Ù (wb.h.Prod.Dout.tag=ROB.head)

wb.l.Prod.Din.valid	=	1
wb.l.Prod.Din.tag	=	0^J
wb.h.Prod.Din.valid	=	1
wb.h.Prod.Din.tag	=	0^J

Furthermore, during writeback, the IEEEf special purpose register is updated with the value of the IEEEf data item of the ROB entry. This is done by the register file environment (section 3.10).

3.9.3 Completion

During completion, the producer parts of the reservation stations put a result on the CDB. This result has to be written into the ROB. This is done with port nine of the ROB1 RAM. The ROB2 RAM does not contain any values which are to be modified during completion.

The valid flag of the CDB is used as the write enable signal of the write port. The write address to the ROB RAM is the tag of the instruction on the CDB. The CDB is used as data input to the RAM, since all data items of the ROB1 RAM have corresponding items on the CDB.

3.10 Register File Environment

3.10.1 Register Values

The register file environment contains the different register files, which are the general purpose register file (GPR), the floating point register file (FPR) and the special purpose register file (SPR). All register files have three ports. The ports one and two are read only; they are used by the issue / decode environment for the source operands. Port three is used for writeback by the reorder buffer. This port is a write only port. The Tomasulo scheduling algorithm prevents concurrent read/write accesses of register values on the same address.

The GPR (figure 3.24) consists of 32 × 32 integer registers (R₀,...,R₃₁). It is implemented as three port 32 × 32 standard RAM. R₀ is defined to be always zero, which is realized by testing the register address and by pulling the output down in case of R₀.

The FPR (figure 3.25) consists of 32 × 32 single precision floating point registers (FGR₀,...,FGR₃₁). These registers can also be accessed as 16 × 64 double precision floating point registers (FPR₀, FPR₂,...,FPR₃₀). The FPR is split into two parts, one for the lower 32 bits and one for the higher 32 bits. It is implemented as two three port 16 × 32 standard RAMs.

The SPR consists of several registers needed for special purposes such as flags and masks. The SPR registers are listed in table 3.15. The SPR is designed in analogy to the SPR in [MP95], i.e., real registers are used instead of a RAM. The interface to the decode/issue environment is identical to the interface of the GPR. Thus, the SPR environment has three address decoders (two for read, one for write).

Of these three decoders, a maximum of two is used simultaneously, because the DLX instruction set has no instruction with two explicit SPR registers as source. The decoders and the output busses for the read ports are in figure 3.26. Based on the values generated by the decoders, signals P1[31:0] to P3[31:0] for the ports one to three are calculated. These signals are used as output enable signals for the drivers of the read ports and as clock enable signals for the write port.

3.10.2 Special Circuits for the SPR

For the SPR register file, several registers have extended access modes. The IEEE standard requires a status flags register for floating point instructions [Ins85, EP97]. It contains a bit for each IEEE exception. Whenever a floating point exception occurs, the flag bit in the IEEEf register is set. Since the IEEE standard requires the IEEE flags to be sticky, the new value is ORed with the old value and re-written into the register (figure 3.27). This only applies to new values from the ROB IEEEf data item, which is only used by floating point instructions. For integer instructions, the IEEEf data item of the ROB is zero. Values written into IEEEf with movi2s are written without modification.

During rfe, only one SPR register has to be modified: The content of the ESR register is copied into the SR register. This is controlled by the DOrfe signal (chapter 3.5.6).

Both JISR and rfe are realized by direct access to the single registers. Figure 3.27 gives the implementation for the SR, ESR, and IEEEf special purpose registers. The implementation of EPC, EPC, ECA, and EDATA is identical to the implementation of ESR. RM and FCC do not require any special circuits.

The registers ECA, SR and EDATA require special circuits. During JISR, the following SPR actions have to be performed:

ESR	=	SR
ECA	=	MCA
SR	=	0
EDATA	=	ROB[ROB.head].Edata

Furthermore, EPC and EPCn are updated by a calculation based on values of the ROB. This calculation is identical to the calculation found in [Lei98]. It ensures that the EPC/EPCn registers hold the PCs of the next two instructions. The calculation is done in dependence of several cases (branch taken/not taken, in delay slot or not). An implementation of this calculation is in figure A.7 (appendix A, page ??).

Figure 3.24: The general purpose registers

Figure 3.25: The floating point registers

Figure 3.26: The decoders of the special purpose register file

Nr. Name Purpose

0 SR Status register (interrupt mask)

1 ESR Exception status register

2 EPC Exception program counter

3 EPCn Exception program counter 2

4 ECA Exception cause register

5 EData Exception data register

6 RM Floating point rounding mode

7 IEEEf IEEE interrupt flags

8 FCC Floating point comparison flag

Table 3.15: Special purpose registers

Figure 3.27: Three special purpose registers

3.10.3 Producer Tables

The register file environment also contains the producer tables (storage for valid bits and tags). There is one table for each of the three register files. All producer tables have four ports. The ports one and two are read only; they are used by the decode/issue environment for the two source operands. Port three is used for writeback by the reorder buffer; it is a write only port. Port four is used by decode/issue in order to set the flags for the destination register. In contrast to the register files, all producer tables are made of register based RAM to allow concurrent read/write access to the same address. Furthermore, all producer tables have an input signal init, which sets all valid bits in one cycle if active.

The GPR producer table (figure 3.28) is implemented as four port 32 × (J+1) register based RAM. R₀ is defined to be always zero, thus it is required to keep R₀.valid always true to prevent result forwarding from instructions writing into R₀.

The FPR producer table (figure 3.29) is split into two parts, just as the FPR register file. It is implemented as two four port 16 × (J+1) register based RAMs.

The SPR producer table (figure 3.30) is similar to the GPR producer table with two exceptions. It does not contain the test for address zero and it has two additional multiplexers in order to realize accesses to the RM and MASK registers during issue of floating point instructions. This saves two extra read ports for operand bus three and four. In case of a floating point instruction, the ports one and two are used for operand bus three and four. Since floating point instructions never read any other special purpose registers, no conflict arises.

Updating of the Producer Tables during Issue

During issue, the valid bit of the destination register has to be cleared and the tag of the instruction has to be stored in the producer table. This is done with RAM port four.

The dest.l.w and dest.h.w signals are the write enable signals of port four of the low and the high memory bank, respectively. The GPR and the SPR only have the low bank. The FPR has a high bank, which is in use while accessing double precision registers or while accessing single precision registers with odd addresses (dest.A[0]=1). The low bank is in use while accessing double precision registers or while accessing single precision registers with even addresses (dest.A[0]=0). Thus:

The valid bit written is always zero. The tag bits written are the ROB tail pointer bits provided by the reorder buffer environment.

Dest.l.Prod.Din.valid	=	0
Dest.l.Prod.Din.tag	=	ROB.tail
Dest.h.Prod.Din.valid	=	0
Dest.h.Prod.Din.tag	=	ROB.tail

Figure 3.28: The general purpose registers producer table

Figure 3.29: The floating point registers producer table

Figure 3.30: The special purpose registers producer table

	Target	Active control	Monomials
	state	signals	IR[31:26]	IR[6]	IR[5:0]
	ALU	rtype, FU.alu	000000	*	0001**
			000000	*	10****
	Shifti	rtype, FU.alu	000000	*	0000**
	ALUi	itype, FU.alu	01**	*	******
	Load	itype, load, FU.mem	100***	*	******
	Load.s	itype, load, fp, FU.mem	110001	*	******
	Load.d	itype, load, fp, db, FU.mem	110101	*	******
	Store	itype, store, FU.mem	101***	*	******
	Store.s	itype, store, fp, FU.mem	111001	*	******
	Store.d	itype, store, fp, db, FU.mem	111101	*	******
	Faddsub.s	rtype, faddsub, fp, FU.fadd	010001	0	00000*
	Faddsub.d	rtype, faddsub, fp, db, FU.fadd	010001	1	00000*
	Fmul.s	rtype, fmul, fp, FU.fmul	010001	0	000010
	Fmul.d	rtype, fmul, fp, db, FU.fmul	010001	1	000010
	Fdiv.s	rtype, fdiv, fp, FU.fdiv	010001	0	000011
	Fdiv.d	rtype, fdiv, fp, db, FU.fdiv	010001	1	000011
	Fcond.s	rtype, fcc, fp, FU.ftest	010001	0	11****
ID1	Fcond.d	rtype, fcc, fp, FU.ftest	010001	1	11****
	Fabsneg.s	rtype, fabsneg, fp, FU.fconv	010001	0	00010*
	Fabsneg.d	rtype, fabsneg, fp, db, FU.fconv	010001	1	00010*
	Ff2i	rtype, ff2i, fp, FU.fconv	010001	*	001001
	Fi2f	rtype, fi2f, fp, FU.fconv	010001	*	001010
	FMov.s	rtype, fmov, fp, FU.fconv	010001	0	001000
	FMov.d	rtype, fmov, fp, db, FU.fconv	010001	1	001000
	FConv.s	rtype, fconv, fp, FU.fconv	010001	*	100*00
	FConv.d	rtype, fconv, fp, db, FU.fconv	010001	*	100001
	Branch	itype, bjjr, branch, noFU	00010*	*	******
	FBranch	itype, bjjr, branch, fp, noFU	00011*	*	******
	JumpReg	itype, bjjr, bjtaken, jumpR, noFU	010110	*	******
	Jump&LinkReg	itype, jalr, bjtaken, jumpR, noFU	010111	*	******
	Jump	jtype, bjjr, bjtaken, jump, noFU	000010	*	******
	Jump&Link	jtype, jalr, bjtaken, jump, noFU	000011	*	******
	Trap	jtype, trap, noFU	111110	*	000000
	RFE	jtype, rfe, noFU	111111	*	******
	Movs2i	rtype, movs2i, FU.alu	000000	*	010000
	Movi2s	rtype, movi2s, FU.alu	000000	*	010001
	FUnimp	iuFOP, noFU	010001	*	00011*
			010001	*	01****
	Illegal (z₀)	ill, noFU	-
	Taken	bjtaken	AEQZ · /IR1[26]
			/AEQZ · IR1[26]
ID2			FCCEQZ · /IR1[26]
			/FCCEQZ · IR1[26]
	Untaken		/taken

State	Instructions	op1.	op2.	op3.	op4.	dest.
ALU	add, sub, test/set, shift	RS1	RS2	-	-	RD
ALUi	addi, subi, test/set immediate	RS1	imm	-	-	RD
Shifti	shift with shift amount	RS1	imm	-	-	RD
Load	load GPR	RS1	-	-	-	RD
Load.s	load single precision FPR	RS1	-	-	-	FD
Load.d	load double precision FPR	RS1	-	-	-	FD
Store	store GPR	RS1	RD	-	-	-
Store.s	store single precision FPR	RS1	FD	-	-	-
Store.d	store double precision FPR	RS1	FD	-	-	-
Faddsub.s	fadd.s, fsub.s	FS1	FS2	RM	MASK	FD
Faddsub.d	fadd.s, fsub.s	FS1	FS2	RM	MASK	FD
Fmul.s	fmul.s	FS1	FS2	RM	MASK	FD
Fmul.d	fmul.d	FS1	FS2	RM	MASK	FD
Fdiv.s	fdiv.s	FS1	FS2	RM	MASK	FD
Fdiv.d	fdiv.d	FS1	FS2	RM	MASK	FD
Fcond.s	fc.cond.s	FS1	FS2	-	MASK	FCC
Fcond.d	fc.cond.d	FS1	FS2	-	MASK	FCC
Fabsneg.s	fabs.s, fneg.s	FS1	-	-	-	FD
Fabsneg.d	fabs.d, fneg.d	FS1	-	-	-	FD
Ff2i	mf2i	FS1	-	-	-	RS2
Fi2f	mi2f	RS2	-	-	-	FS1
FMov.s	mov.s	FS1	-	-	-	FD
FMov.d	mov.d	FS1	-	-	-	FD
FConv.s	cvt.s.d, cvt.s.i, cvt.i.s, cvt.i.d	FS1	-	-	-	FD
FConv.d	cvt.d.i, cvt.d.s	FS1	-	-	-	FD
Branch	beqz, bnez	RS1	-	-	-	-
FBranch	fbeqz, fbnez	FCC	-	-	-	-
JumpReg	jr	RS1	-	-	-	-
Jump&LinkReg	jalr	RS1	-	-	-	R31
Jump	j	-	-	-	-	-
Jump&Link	jal	-	-	-	-	R31
Trap	trap	-	-	-	-	-
RFE	rfe	-	-	-	-	-
Movs2i	movs2i	SA	-	-	-	RD
Movi2s	movi2s	RS1	-	-	-	SA

FU	Purpose
FU[0] = FU.alu	integer instructions, movi2s, movs2i
FU[1] = FU.mem	load, store
FU[2] = FU.fadd	floating point addition and substraction
FU[3] = FU.fmul	floating point multiplication
FU[4] = FU.fdiv	floating point division
FU[5] = FU.fconv	conversion floating point / integer
FU[6] = FU.ftest	floating point condition tests

Bus	Items	Width	Purpose
op1	l	J+32+1	low part of the first operand
	h	J+32+1	high part of the first operand
	high	1	lowest bit of the register address
op2	l	J+32+1	low part of the first operand
	h	J+32+1	high part of the first operand
	high	1	lowest bit of the register address
op3	l	J+32+1	third operand (always integer)
op4	l	J+32+1	fourth operand (always integer)

Item	Width	Purpose
tag	J	ROB tag of the instruction producing the operand
valid	1	valid =1 Û operand contains valid data
data	32	actual operand data

RS_i.opx		RS_i-1.opx	new value of
fill	readCDB	readCDB	RS_i.opx.data
0	0	*	RS_i.opx.data
0	1	*	CDB.data
1	*	0	RS_i-1.opx.data
1	*	1	CDB.data

RS_n_{_j}_-2.full	RS_n_{_j}_-1.full	RS_n_{_j}_-2.doe	RS_n_{_j}_-1.doe	RS_n_{_j}_-1.clear	RS_n_{_j}_-1.fill	action in RS_n_{_j}_-1
0	0	0	0	0	1	copy previous RS, which is empty
		0	1
		1	0	not possible
		1	1
0	1	0	0	0	0	no action
		0	1	0	1	copy previous RS, which is empty
		1	0	not possible
		1	1	not possible
1	0	0	0	0	1	copy instruction in previous RS
		0	1	not possible
		1	0	1	1	clear RS, although already empty
		1	1	not possible
1	1	0	0	0	0	no action
		0	1	0	1	replace the current instruction with instruction in previous RS
		1	0	1	0	no action
		1	1	not possible

Inputs		Result
high	db	low part	high part
0	0	data[31:0]	0³²
1	0	data[63:32]	0³²
0	1	data[31:0]	data[63:32]
1	1	not possible

op[4]	op[3]	op[2]	op[1]	op[0]	Function
0	0	0	0	0	a << b
0	0	0	1	0	a >> b
0	0	0	1	1	a >> b (arithmetic)
1	0	0	0	0	a+b with test of overflow
1	0	0	0	1	a+b without test of overflow
1	0	0	1	0	a-b with test of overflow
1	0	0	1	1	a-b without test of overflow
1	0	1	0	0	a Ù b
1	0	1	0	1	a Ú b
1	0	1	1	0	a Å b
1	0	1	1	1	b[0:15] 0¹⁶
1	1	0	0	1	a > b ? 1 : 0
1	1	0	1	0	a = b ? 1 : 0
1	1	0	1	1	a ³ b ? 1 : 0
1	1	1	0	0	a < b ? 1 : 0
1	1	1	0	1	a ¹ b ? 1 : 0
1	1	1	1	0	a £ b ? 1 : 0

Purpose	Latency	# RS
floating point addition and substraction	5	2
floating point multiplication	5	2
floating point division	15	1
conversion floating point / integer	4	1
floating point condition tests	1	1

j Î ( R(t) Ç M(t))	Û	(j Î R(t)) Ù (j Î M(t))
( R(t) Ç M(t))_j	=	R_j(t) Ù M_j(t)

Name	Width	ROB	Purpose
valid	1	ROB1	valid =1 Û data contains a valid value
data	64	ROB1	result data
dmal	1	ROB1	misaligned data memory access
Dpf	1	ROB1	data memory page fault
ovf	1	ROB1	overflow in ALU instruction
IEEEf	5	ROB1	IEEE flags (only used by floating point instr.)
EData	32	ROB1	exception data
	å 105
ill	1	ROB2	illegal instruction
imal	1	ROB2	misaligned instruction memory access
Ipf	1	ROB2	instruction memory page fault
trap	1	ROB2	trap =1 Û instruction is a trap instruction
uFOP	1	ROB2	unimplemented floating point instruction
dest	4	ROB2	destination register address
db	1	ROB2	db =1 Û result has double precision
fpr	1	ROB2	fpr =1 Û dest is a floating point register
spr	1	ROB2	spr =1 Û dest is a special purpose register
gpr	1	ROB2	gpr =1 Û dest is a general purpose register
PC	32	ROB2	PC of the instruction
target	32	ROB2	target / fallthrough address
bj	1	ROB2	bj =1 Û instruction is a branch/jump
	å 78

Port	Use	Purpose
1	read only ROB1	Forwarding of low part of operand 1
2	read only ROB1	Forwarding of high part of operand 1
3	read only ROB1	Forwarding of low part of operand 2
4	read only ROB1	Forwarding of high part of operand 2
5	read only ROB1	Forwarding of operand 3
6	read only ROB1	Forwarding of operand 4
7	read only ROB1, ROB2	Retire
8	write only ROB1, ROB2	Issue (destination)
9	write only ROB1	Completion

Interrupt	Symbol	Priority	Resume	Maskable	External
reset	reset	0	abort	no	yes
illegal instruction	ill	1	abort	no
misaligned access	mal	2
page fault IM	Ipf	3	repeat
page fault DM	Dpf	4
trap	trap	5	continue
FXU overflow	ovf	6	continue	yes	no
FPU overflow	fOVF	7
FPU underflow	fUNF	8	abort/
FPU inexact result	fINX	9	continue
FPU divide by zero	fDBZ	10
FPU invalid operation	fINV	11
FPU unimplemented	uFOP	12	continue	no
external I/O	ex_j	12+j	continue	yes	yes

Nr.	Name	Purpose
0	SR	Status register (interrupt mask)
1	ESR	Exception status register
2	EPC	Exception program counter
3	EPCn	Exception program counter 2
4	ECA	Exception cause register
5	EData	Exception data register
6	RM	Floating point rounding mode
7	IEEEf	IEEE interrupt flags
8	FCC	Floating point comparison flag