High-bandwidth Address Generation Unit

In this paper we present an efficient data fetch circuitry to retrieve several operands from a n-way parallel memory system in a single machine cycle. The proposed address generation unit operates with an improved version of the low-order parallel memory access approach. Our design supports data structures of arbitrary lengths and different odd strides. The experimental results show that our address generation unit is capable of generating eight 32 − bit addresses every 6 ns for different strides when implemented on a VIRTEX-II PRO xc2vp30-7ff1696 FPGA device using only trivial hardware resources.

[1]  Mateo Valero,et al.  Vector architectures: past, present and future , 1998, ICS '98.

[2]  Mateo Valero,et al.  Command vector memory systems: high performance at low cost , 1998, Proceedings. 1998 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.98EX192).

[3]  M.H. Sunwoo,et al.  Design of address generation unit for audio DSP , 2004, Proceedings of 2004 International Symposium on Intelligent Signal Processing and Communication Systems, 2004. ISPACS 2004..

[4]  Michael R. Macedonia,et al.  The GPU Enters Computing's Mainstream , 2003, Computer.

[5]  Mateo Valero,et al.  Exploiting instruction- and data-level parallelism , 1997, IEEE Micro.

[6]  Mitsumasa Koyanagi,et al.  A new multiport memory for high performance parallel processor system with shared memory , 1998, Proceedings of 1998 Asia and South Pacific Design Automation Conference.

[7]  David Abramson,et al.  Automated synthesis of interleaved memory systems for custom computing machines , 1998, Proceedings. 24th EUROMICRO Conference (Cat. No.98EX204).

[8]  Sally A. McKee,et al.  Design of a parallel vector access unit for SDRAM memory systems , 2000, Proceedings Sixth International Symposium on High-Performance Computer Architecture. HPCA-6 (Cat. No.PR00550).

[9]  C. John Glossner,et al.  Instruction set extensions for software defined radio on a multithreaded processor , 2005, CASES '05.

[10]  Ram Krishnamurthy,et al.  A 4 GHz 130 nm address generation unit with 32-bit sparse-tree adder core , 2002, VLSIC 2002.

[11]  André Seznec,et al.  Interleaved Parallel Schemes , 1994, IEEE Trans. Parallel Distributed Syst..

[12]  André Seznec,et al.  Interleaved parallel schemes: improving memory throughput on supercomputers , 1992, ISCA '92.

[13]  Mateo Valero,et al.  Three-dimensional memory vectorization for high bandwidth media memory systems , 2002, MICRO.

[14]  Jong Won Park Multiaccess Memory System for Attached SIMD Computer , 2004, IEEE Trans. Computers.

[15]  Stamatis Vassiliadis,et al.  Multimedia rectangularly addressable memory , 2006, IEEE Transactions on Multimedia.

[16]  Richard M. Russell,et al.  The CRAY-1 computer system , 1978, CACM.

[17]  Stamatis Vassiliadis,et al.  Reconfigurable Fixed Point Dense and Sparse Matrix-Vector Multiply/Add Unit , 2006, IEEE 17th International Conference on Application-specific Systems, Architectures and Processors (ASAP'06).

[18]  Paul Budnik,et al.  The Organization and Use of Parallel Memories , 1971, IEEE Transactions on Computers.

[19]  Duncan H. Lawrie,et al.  The Prime Memory System for Array Access , 1982, IEEE Transactions on Computers.

[20]  Jong Won Park An Efficient Buffer Memory System for Subarray Access , 2001, IEEE Trans. Parallel Distributed Syst..

[21]  Stamatis Vassiliadis,et al.  The MOLEN polymorphic processor , 2004, IEEE Transactions on Computers.

[22]  R.K. Krishnamurthy,et al.  A 9-GHz 65-nm Intel® Pentium 4 Processor Integer Execution Unit , 2006, IEEE Journal of Solid-State Circuits.

[23]  David T. Harper,et al.  Conflict-Free Vector Access Using a Dynamic Storage Scheme , 1991, IEEE Trans. Computers.

[24]  Sally A. McKee,et al.  Algorithmic foundations for a parallel vector access memory system , 2000, SPAA '00.

[25]  Steven W. Hammond,et al.  Architecture and Application: The Performance of the NEC SX-4 on the NCAR Benchmark Suite , 1996, Proceedings of the 1996 ACM/IEEE Conference on Supercomputing.

[26]  David J. Kuck,et al.  The Burroughs Scientific Processor (BSP) , 1982, IEEE Transactions on Computers.

[27]  Hunter Scales,et al.  AltiVec Extension to PowerPC Accelerates Media Processing , 2000, IEEE Micro.

[28]  H. Peter Hofstee,et al.  Introduction to the Cell multiprocessor , 2005, IBM J. Res. Dev..

[29]  David T. Harper,et al.  Increased Memory Performance During Vector Accesses Through the use of Linear Address Transformations , 1992, IEEE Trans. Computers.

[30]  Gurindar S. Sohi High-Bandwidth Interleaved Memories for Vector Processors-A Simulation Study , 1993, IEEE Trans. Computers.

[31]  Wonyong Sung,et al.  An FPGA based SIMD processor with a vector memory unit , 2006, 2006 IEEE International Symposium on Circuits and Systems.

[32]  David H. Bailey,et al.  Vector Computer Memory Bank Contention , 1987, IEEE Transactions on Computers.

[33]  Shreekant S. Thakkar,et al.  Internet Streaming SIMD Extensions , 1999, Computer.

[34]  David T. Harper,et al.  Block, Multistride Vector, and FFT Accesses in Parallel Memory Systems , 1991, IEEE Trans. Parallel Distributed Syst..

[35]  Stamatis Vassiliadis,et al.  Implementation and evaluation of the Complex Streamed Instruction set , 2001, Proceedings 2001 International Conference on Parallel Architectures and Compilation Techniques.

[36]  R. Krishnamurthy,et al.  A 4 GHz 130 nm address generation unit with 32-bit sparse-tree adder core , 2002, 2002 Symposium on VLSI Circuits. Digest of Technical Papers (Cat. No.02CH37302).

[37]  Kai Hwang,et al.  Computer architecture and parallel processing , 1984, McGraw-Hill Series in computer organization and architecture.

[38]  Stamatis Vassiliadis,et al.  Reconfigurable Multiple Operation Array , 2005, SAMOS.

[39]  Eduard Ayguadé,et al.  Conflict-Free Access for Streams in Multimodule Memories , 1995, IEEE Trans. Computers.