Chapter 4 Memory System

4.1 Overview of the Data Memory System

The data memory interface is embedded just as an ordinary floating point unit and handles both loads and stores. There are only few exceptions from this rule. Figure 4.1 depicts the complete data memory function unit including the reservation stations.

Figure 4.1: The data memory reservation stations

The address operand of a memory instruction is always a GPR register and it is transported in the low part of operand bus one. The immediate constant (provided by the decode/issue environment), which is used as address offset, is added to the value on this bus before it is stored in reservation station R₀. If the operand is already valid during issue, the sum is the correct memory address. If not so, the decode/issue environment puts zero on the operand bus. In this case, the sum is the immediate constant.

The second operand bus is only used by store instructions and provides the actual value to be stored. The operand busses op3 and op4 are not used by the data memory system.

The instructions and operands provided on these busses are stored in the data memory reservation stations. The data memory reservation stations are described in the next section. During dispatch, one instruction is passed from a reservation station to the single-adjust-one circuit (figure 4.2). This circuit is identical to the single-adjust-one circuit presented in chapter 3 except that it only modifies operand two. Operand one is always integer. After leaving the single adjust circuit, the instruction is passed to the data memory interface, which contains the actual interface to the data memory or data memory cache. After the memory access, the data memory interface passes the result of the instruction to the single-adjust-two circuit, which is identical to the single-adjust-two circuit in chapter 3. The single-adjust-two circuit passes the result to the producer circuit, which propagates it on the CDB.

Figure 4.2: Single adjust for the data memory reservation stations

4.2 The Data Memory Reservation Station

Each reservation station (figure 4.3) can hold one load/store instruction and its operands. The reservation station has a register for the full bit, the tag bits and an operation code op. The full bit indicates that the reservation station is in use. The tag data item is the ROB tag of the instruction in the reservation station. The op data item has the following components:

Figure 4.3: A single data memory reservation station

The op.load data item is one if the instruction is a load, and it is zero otherwise.
The op.fp data item is active iff the instruction is a floating point load or store.
The op.db data item is active iff the instruction is a double precision operation.
The op.op2.high data item is the least significant bit of the address of the data source register on stores, op.dest.high is the least significant bit of the address of the destination register on loads.
The op.IR[28:26] data items are bits 26 to 28 of the instruction word and are used to determine the width of the memory operand (byte, halfword, word, double).

The first operand in the data memory reservation station is the address operand. This operand is always 32 bits wide. Thus, one reservation station operand is sufficient to store this data. Since the immediate constant has to be added to the address register, this operand requires special circuits (figure 4.4). It is identical to the usual reservation station operand circuit presented in chapter 3 (figure 3.15, page ??) except for the additional adder. This adder calculates the sum of the data on the CDB and the data in the operand register, which is the immediate constant.

Figure 4.4: The data memory reservation station address operand

The second operand is only used for store instructions. It holds the value to be stored. Since this can be a double precision floating point value, two reservation station operands are required, one for the low and one for the high part. The operand circuit used for operand two is identical to the operand circuit presented in chapter 3 (figure 3.15).

The operation of the reservation stations of the memory system is identical to the operation of the reservation stations presented in chapter 3. However, the dispatch protocol is modified to ensure data integrity.

4.3 Dispatch Protocol

The dispatch protocol used in this design is taken from [Mül97a]. There are four conditions whether to dispatch a store in entry i: The first condition ensures that all operands of the entry are valid. The second condition is a test whether the address operands of all preceding instructions are valid. These instructions are in the reservation stations with indices higher that i. The third condition makes sure that the memory operands of the preceding instructions do not overlap with the memory operand of the instruction in the entry to be dispatched. This condition is tested by the overlap(i,j) macro. The value of overlap(i,j) is true if the memory operands in RS_i and RS_j overlap.

Condition four is an extension of the dispatch protocol presented in [Mül97a] and is necessary to realize precise interrupts. Stores must not be executed before all previous instructions have terminated, because any previous instruction might cause an interrupt. This is realized by comparing the ROB.head pointer with the tag stored in the reservation station.

The conditions for load dispatch are different, because loads do not require in-order execution, therefore, the test for overlapping memory operands is omitted if both instructions are a load. Condition four is omitted, too, because loads do not modify the memory.

4.4 Implementation of the Dispatch Protocol

The additional conditions for dispatching a reservation station are implemented in the reservation station itself. The reservation station generates the RS_i.valid signal only iff all operands and all dispatch conditions are valid. The reservation station control of the data memory system is therefore identical to the reservation station control presented in chapter 3.

The condition (1) is tested by the AND-tree in the reservation station (figure 4.3). Conditions (2) and (3) are tested by the address check circuit (figure 4.5). This circuit takes the address operands of all previous reservation stations as input and generates a valid signal named Avalid.

Figure 4.5: The data memory reservation station address comparator operand for reservation station RS_i

The first step in order to calculate this signal is to define the overlap(i,j) macro (figure 4.6). In the given implementation, only two different memory operand widths are considered, which are 64-bit and 32-bit. Halfword and byte wide operands are handled as 32-bit operands. In order to determine this operand with, the macro DB(i,j) is used. It is true iff at least one of the operands in RS_i or RS_j is a double precision value. The test for overlapping operands is done as follows: In case of single precision values, address bits 2 to 31 are compared. If double precision values are involved, address bits 3 to 31 are compared.

DB(i,j) = RS_i.op.db Ú RS_j.op.db

overlap(i,j)	=	(RS_i.op1.data[31:3] = RS_j.op1.data[31:3]) Ù
		((RS_i.op1.data[2] = RS_j.op1.data[2]) Ú /DB(i,j))

The overlap(i,j) macro compares a pair i,j of reservation stations. In order to calculate the Avalid signal, the second step is to apply the overlap macro to all preceding instructions. This step includes a test for loads, which do not require this condition.

RS_i:

Figure 4.6: The overlap(i,j) macro

Condition (4) is tested as follows: The ROBvalid signal is active, iff condition (4) holds.

RS_i.ROBvalid = (ROB.head = RS_i.tag) Ú RS_i.op.load

Further CPI optimization is possible by implementing load forwarding or write combining on stores. In order to save hardware cost at lower performance, it is possible to perform all memory instructions in program order. The exact performance quantification of both implementations is left for simulations.

4.5 Memory Interface

4.5.1 Control

The memory interface (figure 4.7) is a generic function unit with the usual interface to the reservation stations and the producer as used in chapter 3. It has an additional pipeline stage to save cycle time. The pipeline registers of this stage are placed before the input signals of the data memory. M denotes this register. The output signals of the data memory are almost directly connected to the registers of the producer, thus, there are no critical paths through the data memory.

Figure 4.7: Memory interface

The memory is accessed by a 64-bit wide data path. In order to store single bytes, halfwords and words, the memory interface uses eight bank write signals mw[7:0]. These bank write signals are calculated by the Mwgen circuit, which is taken from [Lei98]. The first step is to determine the exact width of the operand. For this purpose, the bits B (byte), H (halfword), W (word), and D (doubleword) are calculated from bits of the instruction word.

The bits B[7:0] are derived from the address bits op1.data[2:0] by a decoder. It specifies the offset of the memory operand in an aligned double word. The bank write signals mw[7:0] and the misalignment signal misa are computed as follows:

The bank write signals mw[7:0] and the D, W, and H signals are stored in the pipeline register M. The mw[7:0] are fed into the data memory and the D, W, and H signals are used in the alignment circuit for loads.

As mentioned above, the data memory interface is built like any other function unit. This implies that the data memory interface has to generate and respect the flow control signals, which are RSvalid, FUvalid, Pstall, and FUstall. The FUstall signal is active iff the data memory interface is not able to accept further instructions. This is the case if the data memory itself is busy (DMEM.busy=1) or if the producer stalls the function unit (Pstall=1).

The FUvalid signal indicates that the function unit provides a valid result. This is true iff there is an instruction in the register (M.full=1) and if the function unit is not stalled (FUstall=0).

The clock enable signal of the pipeline register of the function unit (Mce) is active if the function unit is not stalled (FUstall=0) or if there is an interrupt (JISR=1). In case of an interrupt, the register is cleared.

4.5.2 Memory Exceptions

The memory system can generate two types of exceptions: page faults and misalignment exceptions. Page faults are used to implement virtual memory. The data memory indicates page faults by raising the pff signal. The data memory system propagates this event on the CDB by enabling the Dpf bit of the CDB. The current memory address in M.op1.data is passed in the EData component of the CDB.

Misaligned memory accesses are indicated by the CDB.dmal signal, which is the misa signal stored in the pipeline register.

4.5.3 Alignment Shifts

The align-for-store and align-for-load boxes take care of correct alignment before a store and after a load, respectively. These circuits are specialized shifters. Figure 4.8 depicts the valid alignment of the memory operands and the corresponding values of the lower address bits A[2:0].

Figure 4.8: Valid alignments

The align-for-load (Align4L) circuit (figure 4.9) performs the alignment shift after a load instruction. The first step is to select the bits of the memory operand from the 64-bit memory bus. This is done by three cascaded multiplexers, which are controlled by the address bits A[2:0]. The first multiplexer selects the correct 32-bit word from the 64-bit bus. The second multiplexer selects the correct 16-bit halfword from the 32-bit word generated by the first multiplexer. The third multiplexer selects the correct byte from this halfword.

The DLX instruction set supports two different types of integer load instructions. Loads of 8 or 16 bits memory operands can be performed with or without sign extension. Loads of 32 or 64 bits values are always done without sign extension. The align-for-load circuit uses a macro Sext_n,m(a,s), which is a conditional sign extension of a n-bit value a to m bits if the condition bit s is active. If s is not active, the a is extended to m bits with leading zeros. The circuit is defined in appendix A.2. The condition bit is provided as bit IR[28] in the instruction word. The Align4L circuit returns zero in case of a store instruction in order to have defined values on the CDB.

Figure 4.9: Align for load

The align for store (Align4S) circuit (figure 4.10) is much simpler. The operand provided by the single-adjust-one circuit is copied on all valid locations on the 64-bit memory bus. Three multiplexers select the operand with the correct width.

Figure 4.10: Align for store