论文信息 - Mapping Irregular Applications to DIVA, a PIM-based Data-Intensive Architecture

Mapping Irregular Applications to DIVA, a PIM-based Data-Intensive Architecture

Processing-in-memory (PIM) chips that integrate processor logic into memory devices offer a new opportunity for bridging the growing gap between processor and memory speeds, especially for applications with high memory-bandwidth requirements. The Data-IntensiVe Architecture (DIVA) system combines PIM memories with one or more external host processors and a PIM-to-PIM interconnect. DIVA increases memory bandwidth through two mechanisms: (1) performing selected computation in memory, reducing the quantity of data transferred across the processor-memory interface; and (2) providing communication mechanisms called parcels for moving both data and computation throughout memory, further bypassing the processor-memory bus. DIVA uniquely supports acceleration of important irregular applications, including sparse-matrix and pointer-based computations. In this paper, we focus on several aspects of DIVA designed to effectively support such computations at very high performance levels: (1) the memory model and parcel definitions; (2) the PIM-to-PIM interconnect; and, (3) requirements for the processor-to-memory interface. We demonstrate the potential of PIM-based architectures in accelerating the performance of three irregular computations, sparse conjugate gradient, a natural-join database operation and an object-oriented database query.

[1] Thomas L. Sterling,et al. Microservers: a new memory semantics for massively parallel computing , 1999, ICS '99.

[2] M. Birnbaum,et al. How VSIA Answers the SOC Dilemma , 1999, Computer.

[3] Erik Brunvand,et al. Impulse: building a smarter memory controller , 1999, Proceedings Fifth International Symposium on High-Performance Computer Architecture.

[4] William J. Dally,et al. A bandwidth-efficient architecture for media processing , 1998, Proceedings. 31st Annual ACM/IEEE International Symposium on Microarchitecture.

[5] M. Oskin,et al. Active Pages: a computation model for intelligent memory , 1998, Proceedings. 25th Annual International Symposium on Computer Architecture (Cat. No.98CB36235).

[6] Katherine Yelick,et al. A Case for Intelligent DRAM: IRAM , 1998 .

[7] D. Culler,et al. Active messages: a mechanism for integrating communication and computation , 1998, ISCA '98.

[8] Martin C. Rinard,et al. Commutativity analysis: a new analysis technique for parallelizing compilers , 1997, TOPL.

[9] Jeffrey T. Draper,et al. A bus-efficient low-latency network interface for the PDSS multicomputer , 1997, Proceedings. The Sixth IEEE International Symposium on High Performance Distributed Computing (Cat. No.97TB100183).

[10] D. Burger,et al. Datascalar Architectures , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[11] Christoforos E. Kozyrakis,et al. A case for intelligent RAM , 1997, IEEE Micro.

[12] Katherine Yelick,et al. A Case for Intelligent RAM: IRAM , 1997 .

[13] Yunheung Paek,et al. Parallel Programming with Polaris , 1996, Computer.

[14] Monica S. Lam,et al. Maximizing Multiprocessor Performance with the SUIF Compiler , 1996, Digit. Tech. J..

[15] Aart J. C. Bik,et al. Simple Qualitative Experiments with a Sparse Compiler , 1996, LCPC.

[16] Fong Pong,et al. Missing the Memory Wall: The Case for Processor/Memory Integration , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[17] D. Burger,et al. Memory Bandwidth Limitations of Future Microprocessors , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[18] P.M. Kogge,et al. Pursuing a petaflop: point designs for 100 TF computers using PIM technologies , 1996, Proceedings of 6th Symposium on the Frontiers of Massively Parallel Computation (Frontiers '96).

[19] Jeffrey T. Draper,et al. The Red Rover Algorithm for Deadlock-Free Routing on Bidirectional Rings , 1996, PDPTA.

[20] Maya Gokhale,et al. Processing in Memory: The Terasys Massively Parallel PIM Array , 1995, Computer.

[21] J. Carter,et al. An argument for simple COMA , 1995, Proceedings of 1995 1st IEEE Symposium on High Performance Computer Architecture.

[22] Ian T. Foster,et al. Designing and building parallel programs - concepts and tools for parallel software engineering , 1995 .

[23] David Keppel,et al. Shade: a fast instruction-set simulator for execution profiling , 1994, SIGMETRICS.

[24] Ian Foster,et al. Designing and building parallel programs , 1994 .

[25] A. Cozzolino,et al. Powerpc microprocessor family: the programming environments , 1994 .

[26] Anne Rogers,et al. Early Experiences with Olden , 1993, LCPC.

[27] Monica S. Lam,et al. Global optimizations for parallelism and locality on scalable parallel machines , 1993, PLDI '93.

[28] Anoop Gupta,et al. Design and evaluation of a compiler algorithm for prefetching , 1992, ASPLOS V.

[29] William J. Dally,et al. The message-driven processor: a multicomputer processing node with efficient mechanisms , 1992, IEEE Micro.

[30] S. C. Knowles,et al. Arithmetic processor design for the T9000 transputer , 1991, Optics & Photonics.

[31] Maurice Herlihy,et al. Wait-free synchronization , 1991, TOPL.

[32] Michael Wolfe,et al. More iteration space tiling , 1989, Proceedings of the 1989 ACM/IEEE Conference on Supercomputing (Supercomputing '89).