Exploiting ILP in page-based intelligent memory

This study compares the speed, area, and power of different implementations of Active Pages, an intelligent memory system which helps bridge the growing gap between processor and memory performance by associating simple functions with each page of data. Previous investigations have shown up to 1000X speedups using a block of reconfigurable logic to implement these functions next to each subarray on a DRAM chip. In this study, we show that instruction-level parallelism, not hardware specialization, is the key to the previous success with reconfigurable logic. In order to demonstrate this fact, an Active Page implementation based upon a simplified VLIW processor was developed. Unlike conventional VLIW processors, power and area constraints lead to a design which has a small number of pipeline stages. Our results demonstrate that a four-wide VLIW processor attains comparable performance to that of pure FPGA logic but requires significantly less area and power.

[1]  Jim Zelenka,et al.  A cost-effective, high-bandwidth storage architecture , 1998, ASPLOS VIII.

[2]  K. Yelick,et al.  The Energy Efficiency Of Iram Architectures , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[3]  Joseph A. Fisher,et al.  Very Long Instruction Word architectures and the ELI-512 , 1983, ISCA '83.

[4]  R.K. Gupta,et al.  MORPH: a system architecture for robust high performance using customization (an NSF 100 TeraOps point design study) , 1996, Proceedings of 6th Symposium on the Frontiers of Massively Parallel Computation (Frontiers '96).

[5]  Robert P. Colwell,et al.  A VLIW architecture for a trace scheduling compiler , 1987, ASPLOS.

[6]  N. Seshan High VelociTI processing [Texas Instruments VLIW DSP architecture] , 1998 .

[7]  A GibsonGarth,et al.  A cost-effective, high-bandwidth storage architecture , 1998 .

[8]  Todd M. Austin,et al.  The SimpleScalar tool set, version 2.0 , 1997, CARN.

[9]  Christoforos E. Kozyrakis,et al.  A case for intelligent RAM , 1997, IEEE Micro.

[10]  R. Eigenmann,et al.  Hierarchical processors-and-memory architecture for high performance computing , 1996, Proceedings of 6th Symposium on the Frontiers of Massively Parallel Computation (Frontiers '96).

[11]  Henk A. Dijkstra,et al.  The trimedia tm-1 pci vliw media processor , 1996 .

[12]  Frederic T. Chong,et al.  Active pages: a computation model for intelligent memory , 1998, ISCA.

[13]  Richard J. Carter,et al.  Teramac configurable custom computer , 1995, Optics East.

[14]  Katherine Yelick,et al.  A Case for Intelligent RAM: IRAM , 1997 .

[15]  Michael D. Noakes,et al.  The J-machine multicomputer: an architectural evaluation , 1993, ISCA '93.

[16]  Robert P. Colwell,et al.  A VLIW architecture for a trace scheduling compiler , 1987, ASPLOS 1987.

[17]  Joel H. Saltz,et al.  Active disks: programming model, algorithms and evaluation , 1998, ASPLOS VIII.

[18]  Carl Ebeling,et al.  Mapping applications to the RaPiD configurable architecture , 1997, Proceedings. The 5th Annual IEEE Symposium on Field-Programmable Custom Computing Machines Cat. No.97TB100186).

[19]  Monica S. Lam,et al.  Maximizing Multiprocessor Performance with the SUIF Compiler , 1996, Digit. Tech. J..

[20]  Maya Gokhale,et al.  Processing in Memory: The Terasys Massively Parallel PIM Array , 1995, Computer.

[21]  Erik Brunvand,et al.  Impulse: building a smarter memory controller , 1999, Proceedings Fifth International Symposium on High-Performance Computer Architecture.

[22]  K. Murakami,et al.  Parallel processing RAM chip with 256 Mb DRAM and quad processors , 1997, 1997 IEEE International Solids-State Circuits Conference. Digest of Technical Papers.

[23]  Duncan A. Buell,et al.  Splash 2 - FPGAs in a custom computing machine , 1996 .

[24]  Kiyoo Itoh,et al.  Limitations and challenges of multigigabit DRAM chip design , 1997, IEEE J. Solid State Circuits.

[25]  John R. Ellis,et al.  Bulldog: a compiler for vliw architectures (parallel computing, reduced-instruction-set, trace scheduling, scientific) , 1985 .

[26]  John R. Ellis,et al.  Bulldog: A Compiler for VLIW Architectures , 1986 .

[27]  SarkarVivek,et al.  Space-time scheduling of instruction-level parallelism on a raw machine , 1998 .

[28]  Abhaya Asthana,et al.  Design of an active memory system for network applications , 1994, Proceedings of IEEE International Workshop on Memory Technology, Design, and Test.

[29]  Vivek Sarkar,et al.  Space-time scheduling of instruction-level parallelism on a raw machine , 1998, ASPLOS VIII.

[30]  Jianwen Zhu,et al.  Specification and Design of Embedded Systems , 1998, Informationstechnik Tech. Inform..

[31]  Seth Copen Goldstein,et al.  PipeRench: a co/processor for streaming multimedia acceleration , 1999, ISCA.