Mapping Irregular Applications to DIVA, a PIM-based Data-Intensive Architecture

Processing-in-memory (PIM) chips that integrate processor logic into memory devices offer a new opportunity for bridging the growing gap between processor and memory speeds, especially for applications with high memory-bandwidth requirements. The Data-IntensiVe Architecture (DIVA) system combines PIM memories with one or more external host processors and a PIM-to-PIM interconnect. DIVA increases memory bandwidth through two mechanisms: (1) performing selected computation in memory, reducing the quantity of data transferred across the processor-memory interface; and (2) providing communication mechanisms called parcels for moving both data and computation throughout memory, further bypassing the processor-memory bus. DIVA uniquely supports acceleration of important irregular applications, including sparse-matrix and pointer-based computations. In this paper, we focus on several aspects of DIVA designed to effectively support such computations at very high performance levels: (1) the memory model and parcel definitions; (2) the PIM-to-PIM interconnect; and, (3) requirements for the processor-to-memory interface. We demonstrate the potential of PIM-based architectures in accelerating the performance of three irregular computations, sparse conjugate gradient, a natural-join database operation and an object-oriented database query.

[1]  Thomas L. Sterling,et al.  Microservers: a new memory semantics for massively parallel computing , 1999, ICS '99.

[2]  M. Birnbaum,et al.  How VSIA Answers the SOC Dilemma , 1999, Computer.

[3]  Erik Brunvand,et al.  Impulse: building a smarter memory controller , 1999, Proceedings Fifth International Symposium on High-Performance Computer Architecture.

[4]  William J. Dally,et al.  A bandwidth-efficient architecture for media processing , 1998, Proceedings. 31st Annual ACM/IEEE International Symposium on Microarchitecture.

[5]  M. Oskin,et al.  Active Pages: a computation model for intelligent memory , 1998, Proceedings. 25th Annual International Symposium on Computer Architecture (Cat. No.98CB36235).

[6]  Katherine Yelick,et al.  A Case for Intelligent DRAM: IRAM , 1998 .

[7]  D. Culler,et al.  Active messages: a mechanism for integrating communication and computation , 1998, ISCA '98.

[8]  Martin C. Rinard,et al.  Commutativity analysis: a new analysis technique for parallelizing compilers , 1997, TOPL.

[9]  Jeffrey T. Draper,et al.  A bus-efficient low-latency network interface for the PDSS multicomputer , 1997, Proceedings. The Sixth IEEE International Symposium on High Performance Distributed Computing (Cat. No.97TB100183).

[10]  D. Burger,et al.  Datascalar Architectures , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[11]  Christoforos E. Kozyrakis,et al.  A case for intelligent RAM , 1997, IEEE Micro.

[12]  Katherine Yelick,et al.  A Case for Intelligent RAM: IRAM , 1997 .

[13]  Yunheung Paek,et al.  Parallel Programming with Polaris , 1996, Computer.

[14]  Monica S. Lam,et al.  Maximizing Multiprocessor Performance with the SUIF Compiler , 1996, Digit. Tech. J..

[15]  Aart J. C. Bik,et al.  Simple Qualitative Experiments with a Sparse Compiler , 1996, LCPC.

[16]  Fong Pong,et al.  Missing the Memory Wall: The Case for Processor/Memory Integration , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[17]  D. Burger,et al.  Memory Bandwidth Limitations of Future Microprocessors , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[18]  P.M. Kogge,et al.  Pursuing a petaflop: point designs for 100 TF computers using PIM technologies , 1996, Proceedings of 6th Symposium on the Frontiers of Massively Parallel Computation (Frontiers '96).

[19]  Jeffrey T. Draper,et al.  The Red Rover Algorithm for Deadlock-Free Routing on Bidirectional Rings , 1996, PDPTA.

[20]  Maya Gokhale,et al.  Processing in Memory: The Terasys Massively Parallel PIM Array , 1995, Computer.

[21]  J. Carter,et al.  An argument for simple COMA , 1995, Proceedings of 1995 1st IEEE Symposium on High Performance Computer Architecture.

[22]  Ian T. Foster,et al.  Designing and building parallel programs - concepts and tools for parallel software engineering , 1995 .

[23]  David Keppel,et al.  Shade: a fast instruction-set simulator for execution profiling , 1994, SIGMETRICS.

[24]  Ian Foster,et al.  Designing and building parallel programs , 1994 .

[25]  A. Cozzolino,et al.  Powerpc microprocessor family: the programming environments , 1994 .

[26]  Anne Rogers,et al.  Early Experiences with Olden , 1993, LCPC.

[27]  Monica S. Lam,et al.  Global optimizations for parallelism and locality on scalable parallel machines , 1993, PLDI '93.

[28]  Anoop Gupta,et al.  Design and evaluation of a compiler algorithm for prefetching , 1992, ASPLOS V.

[29]  William J. Dally,et al.  The message-driven processor: a multicomputer processing node with efficient mechanisms , 1992, IEEE Micro.

[30]  S. C. Knowles,et al.  Arithmetic processor design for the T9000 transputer , 1991, Optics & Photonics.

[31]  Maurice Herlihy,et al.  Wait-free synchronization , 1991, TOPL.

[32]  Michael Wolfe,et al.  More iteration space tiling , 1989, Proceedings of the 1989 ACM/IEEE Conference on Supercomputing (Supercomputing '89).