Critical Block Scheduling: A Thread-Level Parallelizing Mechanism for a Heterogeneous Chip Multiprocessor Architecture

Processor-in-Memory (PIM) architectures are developed for high-performance computing by integrating processing units with memory blocks into a single chip to reduce the performance gap between the processor and the memory. The PIM architecture combines heterogeneous processors in a single system. These processors are characterized by their computation and memory-access capabilities. Therefore, a novel mechanism must be developed to identify their capabilities and dispatch the appropriate tasks to these heterogeneous processing elements. Accordingly, this paper presents a novel parallelizing mechanism, called Critical Block Scheduling to fully utilize all of the heterogeneous processors in the PIM architecture. Integrated with our thread-level parallelizing system, Octans, this mechanism decomposes the original program into blocks, produces corresponding dependence graph, creates a feasible execution schedule, and generates corresponding threads for the host and memory processors. The proposed Critical Block Scheduling not only can parallelize programs for PIM architectures but also can apply on other Multi-Processor System-on-Chip (MPSoC) and Chip Multiprocessor (CMP) architectures which consist of multiple heterogeneous processors. The experimental results of real benchmarks are also discussed.

[1]  David L. Landis,et al.  Evaluation of computing in memory architectures for digital image processing applications , 1999, Proceedings 1999 IEEE International Conference on Computer Design: VLSI in Computers and Processors (Cat. No.99CB37040).

[2]  Yuanyuan Zhou,et al.  Thread scheduling for out-of-core applications with memory server on multicomputers , 1999, IOPADS '99.

[3]  William H. Press,et al.  Numerical Recipes: FORTRAN , 1988 .

[4]  Seung-Moon Yoo,et al.  FlexRAM: toward an advanced intelligent memory system , 1999, Proceedings 1999 IEEE International Conference on Computer Design: VLSI in Computers and Processors (Cat. No.99CB37040).

[5]  Christoforos E. Kozyrakis,et al.  A case for intelligent RAM , 1997, IEEE Micro.

[6]  Robert J. Fowler,et al.  MINT: a front end for efficient simulation of shared-memory multiprocessors , 1994, Proceedings of International Workshop on Modeling, Analysis and Simulation of Computer and Telecommunication Systems.

[7]  William H. Press,et al.  Numerical Recipes in Fortran 77 , 1992 .

[8]  C. Greg Plaxton,et al.  Thread Scheduling for Multiprogrammed Multiprocessors , 1998, SPAA '98.

[9]  Josep Llosa,et al.  Swing module scheduling: a lifetime-sensitive approach , 1996, Proceedings of the 1996 Conference on Parallel Architectures and Compilation Technique.

[10]  Christoforos E. Kozyrakis,et al.  Exploiting On-Chip Memory Bandwidth in the VIRAM Compiler , 2000, Intelligent Memory Systems.

[11]  Richard Crisp,et al.  Direct RAMbus technology: the new main memory standard , 1997, IEEE Micro.

[12]  Monica S. Lam,et al.  Maximizing Multiprocessor Performance with the SUIF Compiler , 1996, Digit. Tech. J..

[13]  Josep Torrellas,et al.  Eliminating squashes through learning cross-thread violations in speculative parallelization for multiprocessors , 2002, Proceedings Eighth International Symposium on High Performance Computer Architecture.

[14]  Yuxiong He,et al.  Adaptive Scheduling with Parallelism Feedback , 2006, 2007 IEEE International Parallel and Distributed Processing Symposium.

[15]  Martin Margala,et al.  Using computational RAM for volume rendering , 2000, Proceedings of 13th Annual IEEE International ASIC/SOC Conference (Cat. No.00TH8541).

[16]  Debra A. Hensgen,et al.  The relative performance of various mapping algorithms is independent of sizable variances in run-time predictions , 1998, Proceedings Seventh Heterogeneous Computing Workshop (HCW'98).

[17]  Jaewook Shin,et al.  Mapping Irregular Applications to DIVA, a PIM-based Data-Intensive Architecture , 1999, ACM/IEEE SC 1999 Conference (SC'99).

[18]  Lawrence Rauchwerger,et al.  Effective Automatic Parallelization with Polaris , 1995 .

[19]  Ko-Yang Wang Precise compile-time performance prediction for superscalar-based computers , 1994, PLDI '94.

[20]  Slo-Li Chu,et al.  PSS: a novel statement scheduling mechanism for a high-performance SoC architecture , 2004, Proceedings. Tenth International Conference on Parallel and Distributed Systems, 2004. ICPADS 2004..

[21]  M. Oskin,et al.  Active Pages: a computation model for intelligent memory , 1998, Proceedings. 25th Annual International Symposium on Computer Architecture (Cat. No.98CB36235).