Previous Contents Next

Chapter 5   Cost and Cycle Time

5.1   Hardware Cost

In spite of progress in miniaturization, hardware cost, e.g., transistor count, is still a crucial matter in CPU design. The hardware cost model used in the following sections is presented in [MP95, KP95] at length. The following sections give just a short summary.

The model does not take wiring into account; only gates are relevant for the hardware cost calculation. Since the Tomasulo algorithm requires several large bus structures, further research on this topic might be of interest. Table 5.1 lists the cost and delay of the basic gates. The values are normalized to the cost and the delay of an inverter.


Gate Cost Delay
Inverter Cinv 1 Dinv 1
NAND Cnand 2 Dnand 1
NOR Cnor 2 Dnor 1
AND Cand 2 Dand 2
OR Cor 2 Dor 2
XOR Cxor 4 Dxor 2
XNOR Cxnor 4 Dxnor 2
Multiplexer Cmux 3 Dmux 2
Tristate Driver Cdriv 5 Ddriv 2
Flip-Flop Cff 8 Dff 4

Table 5.1: Cost and delay of the basic gates


The overall cost calculation is done by a program, since resolving recurrences is beyond the interest of this thesis. See appendix B for detailed instructions. The detailed cost values in table 5.2 are calculated for a DLX core with Tomasulo scheduler without data or instruction cache, and 16 reorder buffer entries. The following sections expose the hardware cost of main circuits in this thesis.


Circuit Cost # Total % Figures
two float RS, without FU 9839 2 19678 8.3 3.13 p.??
one float RS, without FU 5600 3 16800 7.1 3.13 p.??
Floating point adder 23735 1 23735 10.1  
Floating point mul/div unit 47557 1 47557 20.2  
Floating point converter 15926 1 15926 6.7  
Floating point transfer 2209 1 2209 0.9  
four integer RS, without FU 7201 1 7201 3.1 3.13 p.??
Integer ALU 3693 1 3693 1.6 A.4 p.??
Data memory environment (four RS) 37846 1 37846 16.0 4.1 p.??
Instruction memory environment 70 1 70 0.0 3.5 p.??
Instruction register environment 158 1 158 0.1 3.6 p.??
PC environment 2252 1 2252 1.0 3.3 p.??
CDB control environment 196 1 196 0.1 3.22 p.??
Decode / issue environment 3742 1 3742 1.6 3.9 p.??
Reorder buffer environment 19807 1 19807 8.4 3.23 p.??
Register files 19545 1 19545 8.3 3.24 p.??
Producer tables 15574 1 15574 6.6 3.28 p.??
Total     235989 100.0  

Table 5.2: Cost of the core components


5.1.1   Cost Calculation of the Control Automaton

The cost of the control boolean equations in the decode/issue environment (page ??) are estimated with a model found in [MP95]. The automaton in this thesis does not store the state. Thus, the circuits for the next state calculation and the output signals calculation are sufficient.

The calculation of cost and delay depends on a set of parameters. This set is given in table 5.3. The cost of the automata is calculated in three steps: CM denotes the cost of the calculation of the monomials. With the result of this calculation, the next state is calculated at cost CN. The state is not stored but depending on it, the output signals are calculated at cost CO.



The cost calculation program uses these formulae in order to calculate the cost of the ID1 and ID2 automata in the opgen() function.


Symbol Meaning ID1 ID2
s # input signals 13 3
g # output signals 45 1
k # states 37 2
z z=é log k ù 6 1
nmax maximal frequency of a control signal 21 1
nsum accumulated frequency of all control signals 196 1
#M # monomials, nontrivial 39 4
lmax length of longest monomial 13 2
lsum accumulated length of all monomials 340 8
faninmax maximal fanin of n ¹ z0 2 8
faninsum accumulated fanin 38 8

Table 5.3: Parameters for the control automata ID1 and ID2


5.1.2   Reservation Stations and Function Units

With more than 30% of the total cost of the core, the reservation stations are the most expensive data structure in the design, mainly because of the large number of registers and equality testers for each of the six operands. In the data memory reservation stations, the address adders have the biggest cost share. The adders calculate the sum of the memory address operand and the immediate constant co1. The immediate constant is 16 bits wide with sign extension to 32 bits. Thus, the total cost of the data memory reservation station can benefit from a specialized adder, which assumes that the upper 16 input bits of one operand are equal.

Since [Ger98, Del98] state that many reservation stations have a low use in common benchmarks, the total cost of the reservation stations may be reduced by removing rarely used reservation stations without or with slight impact on the performance. Four reservation station entries are therefore used for the ALU and the data memory and only two for each floating point function unit. Table 5.4 lists cost values for the different assignments. The last entry is the hardware cost with variable number of reservation stations for each function unit as proposed in chapter 3.

In order to save hardware cost within the data memory environment, processing the load/store instructions strictly in-order could save all address adders up to one and save all the address comparators, since it would be no longer necessary to compare the addresses in the reservation stations.

Another approach to save hardware cost of the reservation stations is to remove operands not worth forwarding, i.e., RM and MASK. This could save up to one third of the total cost of the reservation stations. Section 3.7.2 (page ??) discusses the consequences.

The cost values for the floating point units are taken from [Lei98]. The cost of the floating point converter is estimated. The cost of the ALU environment is estimated from values found in [MP95] since this environment is almost literally copied. In order to save cycle time, the expensive variant with conditional sum adder is used, since the ALU with carry look-ahead adder would be on the critical path.


# RS / FU CPI / speedup cost without cache cost with cache
1 1.6602 0.0% 198076 100.0% 573055 100.0%
2 1.5644 6.1% 229080 115.7% 604059 105.4%
4 1.5161 9.5% 291116 147.0% 666095 116.2%
8 1.4720 12.7% 415216 209.6% 790195 137.9%
var. ~1.5 10.7% 235989 119.1% 610968 106.6%

Table 5.4: Variations of the number of reservation stations. The last line lists the values for the configuration with a variable number of reservation stations depending on the function unit.


5.1.3   Hardware Cost of the ROB and the Register Files

One disadvantage of the reorder buffer is that it requires many RAM ports for forwarding. Thus, the cost impact of adding a large amount of ports to a RAM is significant for the total hardware cost. The cost and delay of a n-bit on chip RAM with A addresses and r read and w write ports is estimated as follows in analogy to [MP95]:

Cram(A,n,r,w) = Cram(A,n) · (0.4+ 0.6 · (2w+r)/2)
Dram(A,n,r,w) = Dram(A,n) · (0.5+ 0.5 · (2w+r)/2)

The ROB has many components and reorder buffer with more entries require more tag bits, which increases the cost of the reservation stations. It is therefore advisable to keep the number of entries small. Table 5.5 shows the cost and CPI values for four different ROB sizes. The simulations in [Ger98] did not show that the CPI decreases at bigger ROB sizes. A ROB size of 16 entries therefore seems to be most cost efficient for the given design.

The GPR and FPR is implemented as RAM to save hardware cost. In order to allow cycles from Din to Dout, the SPR and all producer tables are made of register based RAM, which is much more expensive. However, there are only nine special purpose registers, which keeps the cost of the SPR low.

According to the results in table 5.2, reorder buffer, register files, and producer tables together have about 23 percent of the total core cost.


ROB entries tag bits CPI / speedup cost without cache cost with cache
16 4 1.4720 0.0% 235989 100.0% 610968 100.0%
32 5 1.5186 -3.1% 254008 107.6% 628987 102.9%
64 6 1.4365 2.4% 288177 122.1% 663156 108.5%
128 7 1.4639 0.6% 354667 150.3% 729646 119.4%

Table 5.5: Effect of ROB size on hardware cost and CPI


5.1.4   Caches

The overall cost of the Tomasulo core of about 236k gate equivalents seems to be a huge augmentation compared to the pipelined core (table 5.6). However, any modern CPU design provides data and instruction caches to speed up access to the memory system. These caches are rather large and expensive compared to the actual core. The values in table 5.6 are based on a 16 KB direct mapped cache; they are taken from [MP98]. Comparing both pipelined and Tomasulo architectures with caches shows the cost efficiency of Tomasulo scheduling. At 26 percent higher cost, the design performs 44 percent better. In order to get realistic values, all CPI rates have been simulated with caches.


  Pipelined RSR+ROB Tomasulo
CPU core only 108949 100% 169701 155% 235989 216%
with 16 kb cache 483928 100% 544680 112% 610968 126%
CPI/speedup 2.12 0% 1.73 22% 1.47 44%

Table 5.6: Cost of core and complete CPU and CPI rates


5.2   Cycle Time

In the given hardware model, the maximum clock frequency of a CPU is determined by the accumulated gate delay of the longest path in the design. This path is determined by a special program. In this program, all data paths are modelled by C++ data structures. Details on this program are contained in appendix B.

For the given design without floating point units, the critical path is determined by the program as follows: Figure 5.1 shows the path. It starts in the IR1 register of the decode/issue environment (figure 3.9, page ??). The instruction word is processed by the control automaton ID1 (opgen). After that, the generated values are used to calculate op1.A, the register address of the first operand (figure 3.10, page ??). With this address as index, the tag bit of the register is looked up in the producer table. This tag is used as address for the ROB RAM. The output of the ROB is used as input for the top MIX of the datagen circuit (figure 3.11, page ??). The operand determined by datagen is used as input for address computation on branches in the PC environment. The path ends in the PC0 register and has an accumulated delay of 106, which is comparable to other designs, e.g., the pipelined DLX presented in [MP95].




Figure 5.1: The longest path


For this calculation, it is assumed that the memory interfaces (instruction and data memory) do not increase the delay. In case of slow memory, it is assumed that appropriate caches are added.

The delay can be easily reduced down to 66 by replacing the ROB RAMs with a register based RAM. However, using a register based RAM of this size increases the hardware cost of the core by 34 percent. Adding the FPU introduces a much longer data path with a length of 137 gate delays within the floating point function units [Lei98]. The overall clock rate of the design is therefore determined by the FPU and not by the scheduler used. However, the design of the FPUs leaves much room for optimization.



Table 5.7: Longest path of the design


5.3   Quality Survey and Comparison

5.3.1   Introduction on Quality Metrics

The quality of a given design depends on the cost and on the performance of the design [Grü94, MP95]. These two values have different weights for certain tasks. For a given task, the hardware cost might be more important than the performance or vice versa. In order to quantify this weight, a quality parameter q Î [0,1] is introduced. With q=1, only cost is relevant for the quality. With q=0, only performance counts and with q=0.5, both parameters have equal weight. Realistic quality weights for cost and performance are values between 0.2 and 0.5.

In order to compare two designs given such a weight, the CPI and the cost C in gate equivalents is used for the quality function. It is assumed that both designs have equal or similar cycle times.

Qq= 1/ CPI1-q · Cq
For this definition of quality and regarding a fixed quality parameter q, a design A is better than a design B iff QqA > QqB, i.e., higher values of Qq denote better designs.

5.3.2   Comparison

Two other DLX designs are used as reference: The pipelined DLX with precise interrupts and floating point units presented in [MP95, MP98] and a DLX with result shift register (RSR) and reorder buffer presented in [Lei98]. Table 5.6 contains the cost and CPI values for all three designs. The CPI values are taken from [Ger98, Del98].

Since all designs use similar floating point units, all share the same path in the rounding unit of 137 gate delays. The RSR+ROB DLX in [Lei98] has a critical path with 164 gate delays. Since this path can be easily optimized, the cycle time of the three designs can be assumed to be equal.

Figure 5.2 is a plot of the quality function of these designs in a logarithmic scale. As expected, the pipelined design is always the best design if hardware cost is crucial. The Tomasulo scheduler wins in spite of the high cost if performance counts. The cost and performance of the RSR+ROB design is exactly between both other designs. All three designs have an interval of q, in which their quality Qq is optimal. The points of equal quality (i.e., the value of q with QqA=QqB), are marked with two lines in figure 5.2, to show the ranges of q, in which the single designs are optimal.



Figure 5.2: Quality for Pipelined DLX, RSR+ROB, and Tomasulo+ROB



Previous Contents Next