Active Memory Processor for Network-on-Chip-Based Architecture

Memory-intensive operations and their memory access latency are often the performance bottleneck in parallel applications. In this paper, we investigate the concept of active memory operation which is an active data processing operation performed on the memory side. Utilizing the active memory operation, we can replace multiple transactions of memory accesses over the on-chip network and related computations on the processor side with a smaller number of high-level transactions and computations on the memory side. To realize the concept, we have designed a special-purpose processor called active memory processor which is tightly coupled with the memory and executes the active memory operations. In our case studies, we have applied the concept to five real-world applications (parallelized JPEG, FFT, text indexing for data mining, histogram, and eikonal equation solver) running on a 36--tile architecture with 64 cores and four memory tiles and found that the proposed approach can improve performance by 20.5~ 259.3 percent.

[1]  K. Yelick,et al.  Intelligent RAM (IRAM): chips that remember and compute , 1997, 1997 IEEE International Solids-State Circuits Conference. Digest of Technical Papers.

[2]  Ross T. Whitaker A FAST EIKONAL EQUATION SOLVER FOR PARALLEL SYSTEMS , 2007 .

[3]  Natalya Tatarchuk,et al.  March of the Froblins: simulation and rendering massive crowds of intelligent and detailed creatures on GPU , 2008, SIGGRAPH '08.

[4]  Wei-Fen Lin,et al.  Reducing DRAM latencies with an integrated memory hierarchy design , 2001, Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture.

[5]  Amit Kumar,et al.  NoC with Near-Ideal Express Virtual Channels Using Global-Line Communication , 2008, 2008 16th IEEE Symposium on High Performance Interconnects.

[6]  Seth Copen Goldstein,et al.  Active Messages: A Mechanism for Integrated Communication and Computation , 1992, [1992] Proceedings the 19th Annual International Symposium on Computer Architecture.

[7]  M. Oskin,et al.  Active Pages: a computation model for intelligent memory , 1998, Proceedings. 25th Annual International Symposium on Computer Architecture (Cat. No.98CB36235).

[8]  Luca Benini,et al.  Analysis of error recovery schemes for networks on chips , 2005, IEEE Design & Test of Computers.

[9]  Jung Ho Ahn,et al.  The Design Space of Data-Parallel Memory Systems , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[10]  Kiyoung Choi,et al.  Multiprocessor system-on-chip designs with active memory processors for higher memory efficiency , 2009, 2009 46th ACM/IEEE Design Automation Conference.

[11]  Sally A. McKee,et al.  Design of a parallel vector access unit for SDRAM memory systems , 2000, Proceedings Sixth International Symposium on High-Performance Computer Architecture. HPCA-6 (Cat. No.PR00550).

[12]  William J. Dally,et al.  Stream Processors: Progammability and Efficiency , 2004, ACM Queue.

[13]  Zeljko Hocenski,et al.  Parallel Processing with CUDA in Ceramic Tiles Classification , 2010, KES.

[14]  Peter M. Kogge,et al.  Combined DRAM and logic chip for massively parallel systems , 1995, Proceedings Sixteenth Conference on Advanced Research in VLSI.

[15]  A. Gupta,et al.  The Stanford FLASH multiprocessor , 1994, Proceedings of 21 International Symposium on Computer Architecture.

[16]  James Ze Wang Integrated Region-Based Image Retrieval , 2001, The Information Retrieval Series.

[17]  William J. Dally,et al.  Scatter-add in data parallel architectures , 2005, 11th International Symposium on High-Performance Computer Architecture.

[18]  Aamer Jaleel,et al.  DRAMsim: a memory system simulator , 2005, CARN.

[19]  Christoforos E. Kozyrakis,et al.  A memory system design framework: creating smart memories , 2009, ISCA '09.

[20]  Kiyoung Choi,et al.  Topology/Floorplan/Pipeline Co-Design of Cascaded Crossbar Bus , 2009, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[21]  Pat Hanrahan,et al.  Brook for GPUs: stream computing on graphics hardware , 2004, SIGGRAPH 2004.

[22]  Thomas L. Sterling,et al.  Microservers: a new memory semantics for massively parallel computing , 1999, ICS '99.

[23]  H. Peter Hofstee,et al.  Power efficient processor architecture and the cell processor , 2005, 11th International Symposium on High-Performance Computer Architecture.

[24]  Christopher J. Hughes,et al.  Atomic Vector Operations on Chip Multiprocessors , 2008, 2008 International Symposium on Computer Architecture.

[25]  William J. Dally,et al.  Architectural Support for the Stream Execution Model on General-Purpose Processors , 2007, 16th International Conference on Parallel Architecture and Compilation Techniques (PACT 2007).

[26]  Seung-Moon Yoo,et al.  FlexRAM: Toward an advanced Intelligent Memory system , 1999, 2012 IEEE 30th International Conference on Computer Design (ICCD).

[27]  Christoforos E. Kozyrakis,et al.  Comparing memory systems for chip multiprocessors , 2007, ISCA '07.

[28]  Zhen Fang,et al.  Quantifying the performance contribution of various aspects of AMOs , 2022 .