Introduction

The performance of today's microprocessors is astonishing. Beneath the progress in wafer technology, a big contribution to the improvements achieved in the past years was made by developing sophisticated scheduling algorithms. One of the major scheduling algorithms used in recent CPUs was specified long ago in 1967 by Robert M. Tomasulo [1]. However, up to now concrete data on the impact of the algorithm on hardware cost and cycle time has been missing.

Thus, this thesis gives a detailed implementation of the Tomasulo scheduling algorithm for the DLX RISC architecture [2]. The design is based on a machine presented in [3] and realizes full support for precise interrupts with a reorder buffer [4]. Cost and cycle time are calculated and evaluated with a formal model presented in [5]. The results are compared to other DLX implementations.

Results

The Tomasulo scheduling algorithm is one of the most competitive scheduling algorithms. It provides low CPI rates down to 1.1 which is shown by simulations on common benchmarks in [6,7]. This thesis shows that adding a Tomasulo scheduler does not have any impact on the cycle time of the CPU design.

The Tomasulo scheduling algorithm with precise interrupts is known to be expensive regarding hardware cost. A complete CPU core design counts about 236,000 gate equivalents, which is about two times as much as is needed by a pipelined design with equal function units. Compared to the total costs of a CPU design (including the first level cache), this is just an increase of 26 percent at a 44 percent higher performance.

Outline

Chapter 2 describes some basic concepts, like the hardware model and gives a rough overview of the design. It also includes a terse introduction to the Tomasulo scheduling algorithm itself. Chapter 3 presents all the implementation details on gate level other than the memory system, which is presented separately in chapter 4. In chapter 5, the analysis of the cost and cycle time of the design is carried out. The results are compared to other DLX implementations in order to evaluate the overall design quality impact of different scheduling algorithms. Chapter 6 contains the correctness proof for the hardware.

References

[1]	R.M. Tomasulo. An efficient algorithm for exploiting multiple arithmetic units. In IBM Journal of Research and Development, volume 11 (1), pages 25 33. IBM, 1967.
[2]	J.L. Hennessy and D.A. Patterson. Computer Architecture: A Quantita-tive Approach. Morgan Kaufmann Publishers, INC., San Mateo, CA, 2nd edition, 1996.
[3]	Holger Leister. Quantitative Analysis of Precise Interrupt Mechanism for Processors with Out-Of-Order Execution. PhD thesis, Universität des Saarlandes, Fachbereich 14 Informatik, 1998. Preliminary Version, 03/1998.
[4]	J.E. Smith and A.R. Pleszkun. Implementing precise interrupts in pipelined processors. IEEE Transactions on Computers, 37(5):562-573, 1988.
[5]	S.M. Müller and W.J. Paul. The Complexity of Simple Computer Architectures. Lecture Notes in Computer Science 995. Springer, 1995.
[6]	Nikolaus D. Gerteis. Die Auswirkung von Mechanismen für die präzise Interruptbehandlung auf den Cyclecount von RISC-Prozessoren. Master's thesis, Universität des Saarlandes, Fachbereich 14 Informatik, 1998.
[7]	Peter Dell. Die Auswirkung von Mechanismen zur out-of-order Ausführung auf den Cyclecount von RISC-Architekturen. Master's thesis, Universität des Saarlandes, FB. Informatik, 1998.