论文信息 - High-Bandwidth Address Generation Unit

High-Bandwidth Address Generation Unit

In this paper we describe an efficient data fetch circuitry for retrieving several operands from a n-bank interleaved memory system in a single machine cycle. The proposed address generation (AGEN) unit operates with a modified version of the low-order-interleaved memory access approach. Our design supports data structures with arbitrary lengths and different (odd) strides. A detailed discussion of the 32-bit AGEN design aimed at multiple-operand functional units is presented. The experimental results indicate that our AGEN is capable of producing 8 × 32-bit addresses every 6 ns for different stride cases when implemented on VIRTEX-II PRO xc2vp30-7ff1696 FPGA device using trivial hardware resources.

Stamatis Vassiliadis | Georgi Gaydadjiev | Carlo Galuzzi | Humberto Calderon

[1] David J. Kuck,et al. The Burroughs Scientific Processor (BSP) , 1982, IEEE Transactions on Computers.

[2] Hunter Scales,et al. AltiVec Extension to PowerPC Accelerates Media Processing , 2000, IEEE Micro.

[3] David T. Harper,et al. Block, Multistride Vector, and FFT Accesses in Parallel Memory Systems , 1991, IEEE Trans. Parallel Distributed Syst..

[4] Mitsumasa Koyanagi,et al. A new multiport memory for high performance parallel processor system with shared memory , 1998, Proceedings of 1998 Asia and South Pacific Design Automation Conference.

[5] Eduard Ayguadé,et al. Conflict-Free Access for Streams in Multimodule Memories , 1995, IEEE Trans. Computers.

[6] Richard M. Russell,et al. The CRAY-1 computer system , 1978, CACM.

[7] Stamatis Vassiliadis,et al. Reconfigurable Fixed Point Dense and Sparse Matrix-Vector Multiply/Add Unit , 2006, IEEE 17th International Conference on Application-specific Systems, Architectures and Processors (ASAP'06).

[8] Kai Hwang,et al. Computer architecture and parallel processing , 1984, McGraw-Hill Series in computer organization and architecture.

[9] H. Peter Hofstee,et al. Introduction to the Cell multiprocessor , 2005, IBM J. Res. Dev..

[10] Stamatis Vassiliadis,et al. Reconfigurable Multiple Operation Array , 2005, SAMOS.

[11] Mateo Valero,et al. Exploiting instruction- and data-level parallelism , 1997, IEEE Micro.

[12] C. John Glossner,et al. Instruction set extensions for software defined radio on a multithreaded processor , 2005, CASES '05.

[13] David Abramson,et al. Automated synthesis of interleaved memory systems for custom computing machines , 1998, Proceedings. 24th EUROMICRO Conference (Cat. No.98EX204).

[14] David T. Harper,et al. Increased Memory Performance During Vector Accesses Through the use of Linear Address Transformations , 1992, IEEE Trans. Computers.

[15] Gurindar S. Sohi. High-Bandwidth Interleaved Memories for Vector Processors-A Simulation Study , 1993, IEEE Trans. Computers.

[16] Jong Won Park. An Efficient Buffer Memory System for Subarray Access , 2001, IEEE Trans. Parallel Distributed Syst..

[17] Sally A. McKee,et al. Design of a parallel vector access unit for SDRAM memory systems , 2000, Proceedings Sixth International Symposium on High-Performance Computer Architecture. HPCA-6 (Cat. No.PR00550).

[18] Mateo Valero,et al. Vector architectures: past, present and future , 1998, ICS '98.

[19] Paul Budnik,et al. The Organization and Use of Parallel Memories , 1971, IEEE Transactions on Computers.

[20] Duncan H. Lawrie,et al. The Prime Memory System for Array Access , 1982, IEEE Transactions on Computers.

[21] David T. Harper,et al. Conflict-Free Vector Access Using a Dynamic Storage Scheme , 1991, IEEE Trans. Computers.

[22] André Seznec,et al. Interleaved Parallel Schemes , 1994, IEEE Trans. Parallel Distributed Syst..

[23] Stamatis Vassiliadis,et al. The MOLEN polymorphic processor , 2004, IEEE Transactions on Computers.

[24] Jong Won Park. Multiaccess Memory System for Attached SIMD Computer , 2004, IEEE Trans. Computers.

[25] Stamatis Vassiliadis,et al. Multimedia rectangularly addressable memory , 2006, IEEE Transactions on Multimedia.

[26] Sanu Mathew,et al. A 9-GHz 65-nm Intel® Pentium 4 Processor Integer Execution Unit , 2007, IEEE J. Solid State Circuits.

[27] M.H. Sunwoo,et al. Design of address generation unit for audio DSP , 2004, Proceedings of 2004 International Symposium on Intelligent Signal Processing and Communication Systems, 2004. ISPACS 2004..

[28] Sally A. McKee,et al. Algorithmic foundations for a parallel vector access memory system , 2000, SPAA '00.

[29] Steven W. Hammond,et al. Architecture and Application: The Performance of the NEC SX-4 on the NCAR Benchmark Suite , 1996, Proceedings of the 1996 ACM/IEEE Conference on Supercomputing.

[30] Wonyong Sung,et al. An FPGA based SIMD processor with a vector memory unit , 2006, 2006 IEEE International Symposium on Circuits and Systems.

[31] Stamatis Vassiliadis,et al. Implementation and evaluation of the Complex Streamed Instruction set , 2001, Proceedings 2001 International Conference on Parallel Architectures and Compilation Techniques.

[32] R. Krishnamurthy,et al. A 4 GHz 130 nm address generation unit with 32-bit sparse-tree adder core , 2002, 2002 Symposium on VLSI Circuits. Digest of Technical Papers (Cat. No.02CH37302).

[33] Mateo Valero,et al. Command vector memory systems: high performance at low cost , 1998, Proceedings. 1998 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.98EX192).

[34] Mateo Valero,et al. Three-dimensional memory vectorization for high bandwidth media memory systems , 2002, 35th Annual IEEE/ACM International Symposium on Microarchitecture, 2002. (MICRO-35). Proceedings..

[35] Michael R. Macedonia,et al. The GPU Enters Computing's Mainstream , 2003, Computer.

[36] David H. Bailey,et al. Vector Computer Memory Bank Contention , 1987, IEEE Transactions on Computers.

[37] Shreekant S. Thakkar,et al. Internet Streaming SIMD Extensions , 1999, Computer.

[38] André Seznec,et al. Interleaved parallel schemes: improving memory throughput on supercomputers , 1992, ISCA '92.