Improving workload balance and code optimization in processor-in-memory systems

PIM (Processor-In-Memory) architectures have been proposed in recent years. One major objective of PIM is to reduce the performance gap between the CPU and memory. To exploit the potential benefits of PIM, we designed a statement base parallelizing system-SAGE. In this paper, we extend this system to achieve better performance by devising several comprehensive optimizing techniques, which include IMOP (Intelligent Memory Operation) recognition, tiling for PIM, and a precise mechanism to get load-balanced execution schedule. The experimental results are also presented and discussed.

[1]  M. Castells Multilevel tiling for non-rectangular interation spaces , 1999 .

[2]  Steve Carr,et al.  Combining optimization for cache and instruction-level parallelism , 1996, Proceedings of the 1996 Conference on Parallel Architectures and Compilation Technique.

[3]  M. Oskin,et al.  Active Pages: a computation model for intelligent memory , 1998, Proceedings. 25th Annual International Symposium on Computer Architecture (Cat. No.98CB36235).

[4]  William H. Press,et al.  Numerical Recipes: FORTRAN , 1988 .

[5]  Frederic T. Chong,et al.  Active pages: a computation model for intelligent memory , 1998, ISCA.

[6]  William H. Press,et al.  Numerical Recipes in Fortran 77 , 1992 .

[7]  Tsung-Chuan Huang,et al.  A new analyzing approach for intelligent memory systems , 2001, Computers and Their Applications.

[8]  Csaba Andras Moritz,et al.  FlexCache: A Framework for Flexible Compiler Generated Data Caching , 2000, Intelligent Memory Systems.

[9]  Ko-Yang Wang Precise compile-time performance prediction for superscalar-based computers , 1994, PLDI '94.

[10]  Tsung-Chuan Huang,et al.  SAGE: A New Analysis and Optimization System for FlexRAM Architecture , 2000, Intelligent Memory Systems.

[11]  Robert J. Fowler,et al.  MINT: a front end for efficient simulation of shared-memory multiprocessors , 1994, Proceedings of International Workshop on Modeling, Analysis and Simulation of Computer and Telecommunication Systems.

[12]  David J. Kuck,et al.  A Survey of Parallel Machine Organization and Programming , 1977, CSUR.

[13]  Rajesh K. Gupta,et al.  Adapting cache line size to application behavior , 1999, ICS '99.

[14]  Christoforos E. Kozyrakis,et al.  Exploiting On-Chip Memory Bandwidth in the VIRAM Compiler , 2000, Intelligent Memory Systems.

[15]  Tarek S. Abdelrahman,et al.  Fusion of Loops for Parallelism and Locality , 1997, IEEE Trans. Parallel Distributed Syst..

[16]  Michael C. Huang,et al.  FlexRAM Architecture Design Parameters , 2002 .