Chapter 6 Correctness
The correctness proof of the Tomasulo scheduling algorithm with reorder
buffer itself is already in [Ger98]. It is based on a proof for the
Tomasulo scheduler without reorder buffer in [Mül97a]. The proof
presented here uses a slightly different notation. The correctness proof has
two parts: The first part shows that data consistency is maintained. The
second part proves that the algorithm terminates, i.e., the hardware is
deadlock-free.
6.1 Data Consistency
6.1.1 Definitions
Let I1 to In be a sequence of n instructions. Data consistency
intuitively means that the results generated by any given instruction Ii
in the program conform to the semantics of an abstract DLX machine. These
semantics are defined as transition rules on an abstract configuration set
C(i) of the machine. This configuration includes all DLX registers, which
are R0(i) to R31(i), SPR, FPR, and PC(i), and the rest of the
program to be executed. A shorthand for all register files is RF[](i). The
start configuration C(0) is the configuration after the reset.
This abstract DLX machine processes one instruction with each transition.
Let the source registers of instruction Ii be S1(i) to Ss(i)(i)
with values s1(i) to ss(i)(i) and the destination registers
D1(i) to Dd(i)(i) with values d1(i) to dd(i)(i).
We use s(i) source and d(i) destination registers to handle double
precision floating point instructions without disturbing distinction of
cases. For example, the instruction
FPR0:=FPR2+FPR4
has six source operands (FGR2 to FGR5 and RM and MASK) and three
destination operands (FGR0, FGR1, and IEEEf). Let opi be the
operation of instruction Ii, this is a double precision floating point
addition in the example.
Now, Ii can be specified as
Ii: (D1(i),..,Dd(i)(i)) = opi(S1(i),..,Ss(i)(i))
with values
Ii: (d1(i),..,dd(i)(i)) =
opi(s1(i),..,ss(i)(i))
As mentioned above, the abstract DLX machine processes instruction Ii in
transition i, which results in configuration C(i). Registers not written
by instruction Ii are passed unmodified from configuration C(i-1). Let
last(r,i) be index of the last instruction prior Ii which modified
register r:
last(r,i) = max{j<i | Ij has destination register r }
This allows an easy criterion for data consistency for any given
(non-abstract) machine.
Criterion 1 For any given source register, the value
of this register must be the result of the last instruction writing it:
sx(i) = dy(last(Sx(i),i))
" x Î {1,..,s(i)}, Dy(i)=Sx(i)
There are some exceptions from this rule, e.g., the register R0, which
are left out for sake of simplicity. Furthermore, if there is no such
instruction, this value is undefined, therefore, it is assumed that all
registers, which are used as source register, are initialized with proper
values by the program.
The rest of the section will prove that this condition holds for the machine
presented in chapter 3. The denominators of the abstract DLX
machine for the registers and the values are now used for the Tomasulo DLX
configuration. This is done in analogy to the proof in [Mül97a] with
the help of four invariants.
Invariant 1 During the issue of instruction Ii,
source operand Sx(i) is in the register file, as indicated by
RF[r].valid=1. In this case, the data item of the register holds the
desired value, i.e., it holds the result of the last instruction writing
RF[r].
Invariant 2 During the issue of instruction Ii,
source operand r:=Sx(i) is, as indicated by RF[r].valid=0, the result of
an previous instruction Ij, which has not yet retired. In this case, the
tag item of the register holds Ij.tag, i.e., the tag of the instruction
Ij. This instruction is the last instruction writing RF[r].
Invariant 3 Let instruction Ii be an instruction in
reservation station RSl. If there is an operand x in this reservation
station which is not yet valid (RSl.opx.valid=0), there must be a previous
instruction Ij, which produces the value for the operand and has not yet
completed. The tag of this instruction is the tag in RSl.opx.tag.
Invariant 4 Let invariant 2 or 3 apply for Ii and source
operand r:=Sx(i) with tag Sx(i).tag. If the CDB is valid and if the
CDB tag is equal to the tag of the operand, the result of the last
instruction writing the operand is on the CDB. If the ROB entry pointed to
by the tag of the operand is valid, the result of the last instruction
writing the operand is in this ROB entry.
With the help of these invariants, the proof of the data consistency is done
by induction over n. For n=0, the claim is obvious. The induction for
n>0 requires a distinction of two cases. Let instruction Ii read
source operand Sx(i). This is possible in two different phases.
- Let instruction Ii read register r:=Sx(i) during
issue. In dependence of the value of RF[r].valid, either invariant 1 or
invariant 2 applies. If RF[Sx(i)].valid is set, the claim is an
implication of invariant 1, since the operand is in the register file.
If not so, invariant 2 states that RF[r] contains the tag of the
instruction Ij which produces the result. As the operand of the
instruction is already available during issue, it must be either in the ROB
or on the CDB. In both cases, invariant 4 applies. Thus, the result in the
ROB or on the CDB is d(last(r,i)). The instruction Ij
had correct operands because of j<i.
- Let instruction Ii read register r:=Sx(i) while in
reservation station RSl. This only happens if RSl.opx.valid is not
set. In this case, invariant 3 states that the tag in RSl.opx.tag is
the tag of the instruction Ij which produces d(last(r,i)).
This tag must be equal to the tag on the CDB, thus invariant four applies.
Ii therefore reads the result of Ij. This is the correct result
because of j<i.
6.2 Termination
6.2.1 Definitions
The termination proof will show that an instruction sequence I1 to In
is processed in a finite amount of clock cycles. The following shorthands
are used: fetch(i), issue(i), disp(i), compl(i) and term(i) denote
the cycles in with instruction Ii is fetched, issued, dispatched,
completed and terminated, respectively.
6.2.2 Termination
Lemma 1 All instructions are issued in program order and
terminate in program order, i.e., issue(i)<issue(j) and
term(i)<term(j) for any i<j.
In-order termination is a conclusion of the fact that an instruction
terminates when leaving the ROB. The ROB is a fifo queue and is filled in
program order. Consequently, to prove termination, it is sufficient to prove
that a finite a exists with term(i)+a ³
term(j) for any i<j, i.e., to find an upper bound for the number of
cycles of an instruction. A weak upper bound for a can be determined
by summing up the maximum time which an instruction spends in each stage
after all previous instructions terminated:
a £ a0 + ... + a4
This bound is proved by induction over n. The induction hypothesis is
that instruction Ii obeys to this bound.
Fetch After all previous instructions terminated, instruction Ii
can be fetched if there is no stall from the instruction memory. If
lmem<¥ is the maximum latency of the memory, a0 = lmem.
Issue Instruction Ii can be issued iff there is no issue stall (the
conditions for an issue stall are in section 3.5.6, page
??). This is obviously true one cycle after all previous
instructions terminated and if there is no stall from the instruction
memory. Since lmem is the maximum latency of the memory, a1 = lmem+1.
Dispatch Instruction Ii can be dispatched iff all its operands are
valid and if the function unit is able to accept data. Obviously, the
operands are valid after all previous instructions terminated. However, the
function unit still might be blocked by instructions after Ii, i.e,
instructions Ij with j>i. Let imax be the maximum number of
instructions the function unit can hold (including the producer pipeline
stage), and l the maximum latency of the function unit. For iterative
function units, imax · l is a weak bound for the number of cycles
until the function unit is empty. This bound is valid for pipelined function
units, too. However, before an instruction can leave the function unit, it
is necessary to wait for the CDB. Since the CDB is allocated round robin,
this takes n cycles at most, with n being the number of function units.
Thus, it takes a2=imax · l · n cycles at most until the
function unit is empty. As Ii is the oldest valid instruction in the
reservation stations (all previous already terminated), Ii is dispatched
afterwards.
Completion As shown above, an instruction Ii in a function unit
leaves it (completes) after at most a3=imax · l · n cycles.
Termination A valid (completed) instruction in the ROB terminates one
cycle after all previous instructions terminated. Thus, a4 is one
cycle.