High-Bandwidth Address Generation Unit

In this paper we describe an efficient data fetch circuitry for retrieving several operands from a n-bank interleaved memory system in a single machine cycle. The proposed address generation (AGEN) unit operates with a modified version of the low-order-interleaved memory access approach. Our design supports data structures with arbitrary lengths and different (odd) strides. A detailed discussion of the 32-bit AGEN design aimed at multiple-operand functional units is presented. The experimental results indicate that our AGEN is capable of producing 8 × 32-bit addresses every 6 ns for different stride cases when implemented on VIRTEX-II PRO xc2vp30-7ff1696 FPGA device using trivial hardware resources.

[1]  David J. Kuck,et al.  The Burroughs Scientific Processor (BSP) , 1982, IEEE Transactions on Computers.

[2]  Hunter Scales,et al.  AltiVec Extension to PowerPC Accelerates Media Processing , 2000, IEEE Micro.

[3]  David T. Harper,et al.  Block, Multistride Vector, and FFT Accesses in Parallel Memory Systems , 1991, IEEE Trans. Parallel Distributed Syst..

[4]  Mitsumasa Koyanagi,et al.  A new multiport memory for high performance parallel processor system with shared memory , 1998, Proceedings of 1998 Asia and South Pacific Design Automation Conference.

[5]  Eduard Ayguadé,et al.  Conflict-Free Access for Streams in Multimodule Memories , 1995, IEEE Trans. Computers.

[6]  Richard M. Russell,et al.  The CRAY-1 computer system , 1978, CACM.

[7]  Stamatis Vassiliadis,et al.  Reconfigurable Fixed Point Dense and Sparse Matrix-Vector Multiply/Add Unit , 2006, IEEE 17th International Conference on Application-specific Systems, Architectures and Processors (ASAP'06).

[8]  Kai Hwang,et al.  Computer architecture and parallel processing , 1984, McGraw-Hill Series in computer organization and architecture.

[9]  H. Peter Hofstee,et al.  Introduction to the Cell multiprocessor , 2005, IBM J. Res. Dev..

[10]  Stamatis Vassiliadis,et al.  Reconfigurable Multiple Operation Array , 2005, SAMOS.

[11]  Mateo Valero,et al.  Exploiting instruction- and data-level parallelism , 1997, IEEE Micro.

[12]  C. John Glossner,et al.  Instruction set extensions for software defined radio on a multithreaded processor , 2005, CASES '05.

[13]  David Abramson,et al.  Automated synthesis of interleaved memory systems for custom computing machines , 1998, Proceedings. 24th EUROMICRO Conference (Cat. No.98EX204).

[14]  David T. Harper,et al.  Increased Memory Performance During Vector Accesses Through the use of Linear Address Transformations , 1992, IEEE Trans. Computers.

[15]  Gurindar S. Sohi High-Bandwidth Interleaved Memories for Vector Processors-A Simulation Study , 1993, IEEE Trans. Computers.

[16]  Jong Won Park An Efficient Buffer Memory System for Subarray Access , 2001, IEEE Trans. Parallel Distributed Syst..

[17]  Sally A. McKee,et al.  Design of a parallel vector access unit for SDRAM memory systems , 2000, Proceedings Sixth International Symposium on High-Performance Computer Architecture. HPCA-6 (Cat. No.PR00550).

[18]  Mateo Valero,et al.  Vector architectures: past, present and future , 1998, ICS '98.

[19]  Paul Budnik,et al.  The Organization and Use of Parallel Memories , 1971, IEEE Transactions on Computers.

[20]  Duncan H. Lawrie,et al.  The Prime Memory System for Array Access , 1982, IEEE Transactions on Computers.

[21]  David T. Harper,et al.  Conflict-Free Vector Access Using a Dynamic Storage Scheme , 1991, IEEE Trans. Computers.

[22]  André Seznec,et al.  Interleaved Parallel Schemes , 1994, IEEE Trans. Parallel Distributed Syst..

[23]  Stamatis Vassiliadis,et al.  The MOLEN polymorphic processor , 2004, IEEE Transactions on Computers.

[24]  Jong Won Park Multiaccess Memory System for Attached SIMD Computer , 2004, IEEE Trans. Computers.

[25]  Stamatis Vassiliadis,et al.  Multimedia rectangularly addressable memory , 2006, IEEE Transactions on Multimedia.

[26]  Sanu Mathew,et al.  A 9-GHz 65-nm Intel® Pentium 4 Processor Integer Execution Unit , 2007, IEEE J. Solid State Circuits.

[27]  M.H. Sunwoo,et al.  Design of address generation unit for audio DSP , 2004, Proceedings of 2004 International Symposium on Intelligent Signal Processing and Communication Systems, 2004. ISPACS 2004..

[28]  Sally A. McKee,et al.  Algorithmic foundations for a parallel vector access memory system , 2000, SPAA '00.

[29]  Steven W. Hammond,et al.  Architecture and Application: The Performance of the NEC SX-4 on the NCAR Benchmark Suite , 1996, Proceedings of the 1996 ACM/IEEE Conference on Supercomputing.

[30]  Wonyong Sung,et al.  An FPGA based SIMD processor with a vector memory unit , 2006, 2006 IEEE International Symposium on Circuits and Systems.

[31]  Stamatis Vassiliadis,et al.  Implementation and evaluation of the Complex Streamed Instruction set , 2001, Proceedings 2001 International Conference on Parallel Architectures and Compilation Techniques.

[32]  R. Krishnamurthy,et al.  A 4 GHz 130 nm address generation unit with 32-bit sparse-tree adder core , 2002, 2002 Symposium on VLSI Circuits. Digest of Technical Papers (Cat. No.02CH37302).

[33]  Mateo Valero,et al.  Command vector memory systems: high performance at low cost , 1998, Proceedings. 1998 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.98EX192).

[34]  Mateo Valero,et al.  Three-dimensional memory vectorization for high bandwidth media memory systems , 2002, 35th Annual IEEE/ACM International Symposium on Microarchitecture, 2002. (MICRO-35). Proceedings..

[35]  Michael R. Macedonia,et al.  The GPU Enters Computing's Mainstream , 2003, Computer.

[36]  David H. Bailey,et al.  Vector Computer Memory Bank Contention , 1987, IEEE Transactions on Computers.

[37]  Shreekant S. Thakkar,et al.  Internet Streaming SIMD Extensions , 1999, Computer.

[38]  André Seznec,et al.  Interleaved parallel schemes: improving memory throughput on supercomputers , 1992, ISCA '92.