Memory Latency Effects in Decoupled Architectures

Decoupled computer architectures partition the memory access and execute functions in a computer program and achieve high-performance by exploiting the fine-grain parallelism between the two. These architectures make use of an access processor to perform the data fetch ahead of demand by the execute process and hence are often less sensitive to memory access delays than conventional architectures. Past performance studies of decoupled computers used memory systems that are interleaved or pipelined, and in those studies, latency effects were partially hidden due to interleaving. A detailed simulation study of the latency effects in decoupled computers is undertaken in this paper. Decoupled architecture performance is compared to single processors with caches. The memory latency sensitivity of cache based uniprocessors and decoupled systems is studied. Simulations are performed to determine the significance of data caches in a decoupled architecture. It is observed that decoupled architectures can reduce the peak memory bandwidth requirement, but not the total bandwidth, whereas data caches can reduce the total bandwidth by capturing locality. It may be concluded that despite their capability to partially mask the effects of memory latency, decoupled architectures still need a data cache. >

[1]  Randy H. Katz,et al.  Pipe: a high performance VLSI architecture , 1983 .

[2]  E.S. Davidson,et al.  The effects of memory latency and fine-grain parallelism on Astronautics ZS-1 performance , 1990, Twenty-Third Annual Hawaii International Conference on System Sciences.

[3]  Wm. A. Wulf Evaluation of the WM architecture , 1992, ISCA '92.

[4]  James E. Smith,et al.  Decoupled access/execute computer architectures , 1984, TOCS.

[5]  Alan Jay Smith,et al.  Cache Memories , 1982, CSUR.

[6]  Richard R. Shively Architecture of a Programmable Digital Signal Processor , 1982, IEEE Transactions on Computers.

[7]  James E. Smith,et al.  The ZS-1 central processor , 1987, ASPLOS 1987.

[8]  Edward S. Davidson,et al.  A performance comparison of the IBM RS/6000 and the Astronautics ZS-1 , 1991, Computer.

[9]  Alan Jay Smith,et al.  Line (Block) Size Choice for CPU Cache Memories , 1987, IEEE Transactions on Computers.

[10]  Paul T. Hulina,et al.  Performance Analysis of an Address Generation Coprocessor , 1991, ICPP.

[11]  Michael J. Flynn,et al.  Performance trade-offs for microprocessor cache memories , 1988, IEEE Micro.

[12]  Andrew R. Pleszkun,et al.  Implementation of the PIPE processor , 1991, Computer.

[13]  M. K. Farrens,et al.  Improving performance of small on-chip instruction caches , 1989, ISCA '89.

[14]  Gerry Kane,et al.  MIPS RISC Architecture , 1987 .

[15]  Janak H. Patel,et al.  Performance evaluation of on-chip register and cache organizations , 1988, ISCA '88.

[16]  James E. Smith,et al.  Dynamic instruction scheduling and the Astronautics ZS-1 , 1989, Computer.

[17]  Paul T. Hulina,et al.  Memory latency effects in decoupled architectures with a single data memory module , 1992, ISCA '92.

[18]  Andrew R. Pleszkun,et al.  Features of the Structured Memory Access (SMA) Architecture , 1986, IEEE Computer Society International Conference.

[19]  J. E. Storer,et al.  Functionally Parallel Architecture for Array Processors , 1981, Computer.

[20]  Paul T. Hulina,et al.  Classification and performance evaluation of instruction buffering techniques , 1991, [1991] Proceedings. The 18th Annual International Symposium on Computer Architecture.

[21]  Andrew R. Pleszkun,et al.  Structured Memory Access Architecture , 1983, ICPP.

[22]  James E. Smith,et al.  A Simulation Study of Decoupled Architecture Computers , 1986, IEEE Transactions on Computers.

[23]  Jian-Tu Hsieh Performance evaluation of the pipe computer architecture , 1986 .

[24]  Andrew R. Pleszkun,et al.  PIPE: a VLSI decoupled architecture , 1985, ISCA '85.