The overall cost calculation is done by a program, since resolving recurrences is beyond the interest of this thesis. See appendix B for detailed instructions. The detailed cost values in table 5.2 are calculated for a DLX core with Tomasulo scheduler without data or instruction cache, and 16 reorder buffer entries. The following sections expose the hardware cost of main circuits in this thesis.
Gate Cost Delay Inverter Cinv 1 Dinv 1 NAND Cnand 2 Dnand 1 NOR Cnor 2 Dnor 1 AND Cand 2 Dand 2 OR Cor 2 Dor 2 XOR Cxor 4 Dxor 2 XNOR Cxnor 4 Dxnor 2 Multiplexer Cmux 3 Dmux 2 Tristate Driver Cdriv 5 Ddriv 2 Flip-Flop Cff 8 Dff 4
Circuit Cost # Total % Figures two float RS, without FU 9839 2 19678 8.3 3.13 p.?? one float RS, without FU 5600 3 16800 7.1 3.13 p.?? Floating point adder 23735 1 23735 10.1 Floating point mul/div unit 47557 1 47557 20.2 Floating point converter 15926 1 15926 6.7 Floating point transfer 2209 1 2209 0.9 four integer RS, without FU 7201 1 7201 3.1 3.13 p.?? Integer ALU 3693 1 3693 1.6 A.4 p.?? Data memory environment (four RS) 37846 1 37846 16.0 4.1 p.?? Instruction memory environment 70 1 70 0.0 3.5 p.?? Instruction register environment 158 1 158 0.1 3.6 p.?? PC environment 2252 1 2252 1.0 3.3 p.?? CDB control environment 196 1 196 0.1 3.22 p.?? Decode / issue environment 3742 1 3742 1.6 3.9 p.?? Reorder buffer environment 19807 1 19807 8.4 3.23 p.?? Register files 19545 1 19545 8.3 3.24 p.?? Producer tables 15574 1 15574 6.6 3.28 p.?? Total 235989 100.0

Symbol Meaning ID1 ID2 s # input signals 13 3 g # output signals 45 1 k # states 37 2 z z=é log k ù 6 1 nmax maximal frequency of a control signal 21 1 nsum accumulated frequency of all control signals 196 1 #M # monomials, nontrivial 39 4 lmax length of longest monomial 13 2 lsum accumulated length of all monomials 340 8 faninmax maximal fanin of n ¹ z0 2 8 faninsum accumulated fanin 38 8
# RS / FU CPI / speedup cost without cache cost with cache 1 1.6602 0.0% 198076 100.0% 573055 100.0% 2 1.5644 6.1% 229080 115.7% 604059 105.4% 4 1.5161 9.5% 291116 147.0% 666095 116.2% 8 1.4720 12.7% 415216 209.6% 790195 137.9% var. ~1.5 10.7% 235989 119.1% 610968 106.6%
Table 5.4: Variations of the number of reservation stations. The last line lists the values for the configuration with a variable number of reservation stations depending on the function unit.
| Cram(A,n,r,w) | = | Cram(A,n) · (0.4+ 0.6 · (2w+r)/2) |
| Dram(A,n,r,w) | = | Dram(A,n) · (0.5+ 0.5 · (2w+r)/2) |
ROB entries tag bits CPI / speedup cost without cache cost with cache 16 4 1.4720 0.0% 235989 100.0% 610968 100.0% 32 5 1.5186 -3.1% 254008 107.6% 628987 102.9% 64 6 1.4365 2.4% 288177 122.1% 663156 108.5% 128 7 1.4639 0.6% 354667 150.3% 729646 119.4%
Pipelined RSR+ROB Tomasulo CPU core only 108949 100% 169701 155% 235989 216% with 16 kb cache 483928 100% 544680 112% 610968 126% CPI/speedup 2.12 0% 1.73 22% 1.47 44%
For this calculation, it is assumed that the memory interfaces (instruction and data memory) do not increase the delay. In case of slow memory, it is assumed that appropriate caches are added.
Qq= 1/ CPI1-q · CqFor this definition of quality and regarding a fixed quality parameter q, a design A is better than a design B iff QqA > QqB, i.e., higher values of Qq denote better designs.