TIDBITS: speedup via time-delay bit-slicing in ALU design for VLSI technology

A novel word-wide ALU organtTation based on bitlevel pipelln~ng is proposed. This ALU pipeline is shown to be area e~cient and easy to control. The performance of this ALU organization is analyzed in the context of two applications: integer summation/multiplication and multidimensional array address stream generation. Near optimal speedup with respect to the added area is shown to be achieved for these and other applications where integer additions predominate. mutually dependent. Moreover, the simplicity and regularity of the pipeline control logic allows very compact MOS implementation. Thus the implied constant in the O(n) additional area is quite small. This paper is organized as follows: Section two motivates the problem with a critique of conventional plpelined ALU design, followed by a description of the proposed TIDBITS ALU organization. In section three the performance of the TIDBITS ALU is analyzed together with its data path implications within the context of two applications: integer summation/multiplication and array address stream generation. Section four contains the conclusions. 1. INTRODU~TION Many computer applications call for the addition of a large set of integer numbers. Examples include checksum generation, array processor address stream generation, and integer multiplication. In each of these applications, the dominant component of the addition time is usually the carry propagation time of the arithmetic logic unit (ALU). The throughput of an ALU can be improved either by reducing the carry propagation time using carry lookahead circuitry or by pipeltnlng the carry chain. The advantage of carry lookahead circuitry is that it reduces the addition delay under all circumstances. However, the complexity of carry lookahead circuitry is superlinear with respect to the ALU word size[l] and MOS implementation of carry lookahead adders requires substantial area in practice[2]. If the job load of an ALU is dominated by integer addition, a higher performance, more cost-effective pipeline-based solution is available. This paper proposes a novel ALU design based on time delayed control signals between bit slices (TIDBITS) of the register file and ALU complex. The proposed ALU achieves O(n) speedup relative to an n-bit ripple carry parallel ALU using only O(n) additional area. This speedup is achieved whenever the job load contains a sufficiently large number of integer additions, including the case when the additions are This research was supported by the Joint Services Electronics Program ¢on'trac't N00014-84-C.-0149 and by the Semiconductor Research Corpora'tion con= ttrac't SRC RSCH 84--06--049 2. DE,~I(~N OF THE ALU In a conventional ripple-carry ALU with a register file, the clock speed and hence the throughput of the ALU is determined by the time needed to read and write the register file plus the time required to perform the slowest operation. Since addition is almost always the slowest operation performed by an ALU, and since the carry propagation time is the limiting factor in a ripple-carry ALU, the clock speed is constrained by the time needed to propagate the carry through every bit position. The throughput of a ripple-carry ALU can be increased by pipelining the carry propagation signals. Pipelined ripple-carry ALUs are rarely used in practice because conventional pipelining techniques cannot deliver the speedup in a simple, area-efSclent manner. The problems with conventional pipelining techniques are discussed below. 2.1. C r i t i q u e o f C o n v e n t i o n a l P ipe l ined ALUs A ripple-carry ALU can be pipelined by inserting a carry latch (CL) between slices of one or more bits. An example of a 32 bit, four stage ALU organized as four slices of eight bits each is shown in Figure 1. Appropriate data alignment latches are needed to skew the input to the higher order bit slices (SL) in time and to deskew the output from the low order output slices (DL). The ALU function control signals must also be delayed in ALU control latches (XL) between consecutive slices. The clock speed is faster for the pipelined ALU than for the corresponding nonpipelined ripple-carry ALU because in the pipelined ALU the clock speed is constrained by the 0149-7111/85/0000/0028501.00 © 1985 IEEE 28