Jade: Compiler -Supported Multi-Paradigm Processor Virtualization-Based Parallel Programming

Current parallel programming approaches, which typically use message-passing and shared memory threads, require the programmer to write considerable low-level work management and distribution code to partition and distribute data, perform load distribution and balancing, pack and unpack data into messages, and so on. One solution to this low level of programming is to use processor virtualization, wherein the programmer assumes a large number of available virtual processors and creates a large number of work objects, combined with an adaptive runtime system (ARTS) that intelligently maps work to processors and performs dynamic load balancing to optimize performance. Charm++ and AMPI are implementations of this approach. Although Charm++ and AMPI enable the use of an ARTS, the program specification is still low-level, requiring many details. Furthermore, the only mechanisms for information exchange are asynchronous method invocation and message passing, although some applications are more easily expressed in a shared memory paradigm. We explore the thesis that compiler support and optimizations, and a disciplined shared memory abstraction can substantially improve programmer productivity while retaining most of the performance benefits of processor virtualization and the ARTS. The ideas proposed in this thesis are embodied in a new programming language, Jade, based on Java, Charm++ and AMPI. The language design uses the Java memory model, automating memory management and eliminating void pointers and pointer arithmetic. In addition, by automating various routine tasks in Charm++, programmer productivity is substantially improved. Jade introduces Multiphase Shared Arrays (MSA), which can be shared in read-only, write-many, and accumulate modes. These simple modes scale well and are general enough to capture the majority of shared memory access patterns. We present novel uses of known compilation techniques, as well as new compile-time analyses suggested by the needs of ARTS. One optimization strip-mines MSA loops and optimizes away a test that checks if a page is present in the local MSA cache, resulting in single-CPU MSA performance matching that of a sequential program. Another optimization generates guarded pack/unpack code that only packs live data. This significantly reduces the time taken and disk size needed to checkpoint (or migrate objects within) large applications. The Jade language and compiler system described in this thesis can serve as the framework for further research into compiler-based multi-paradigm ARTS-supported parallel programming built upon processor virtualization.

[1]  Laxmikant V. Kalé,et al.  Jade: A Parallel Message-Driven Java , 2003, International Conference on Computational Science.

[2]  Laxmikant V. Kalé,et al.  Design and Implementation of Parallel Java with Global Object Space , 1997, PDPTA.

[3]  Laxmikant V. Kalé,et al.  Information sharing mechanisms in parallel programs , 1994, Proceedings of 8th International Parallel Processing Symposium.

[4]  Laxmikant V. Kale,et al.  Object-Based Adaptive Load Balancing for MPI Programs∗ , 2000 .

[5]  Franck van Breugel,et al.  Semantic Analysis of Pict in Java , 2003 .

[6]  Vikram S. Adve,et al.  Using integer sets for data-parallel program analysis and optimization , 1998, PLDI.

[7]  Robert J. Harrison,et al.  Global arrays: A nonuniform memory access programming model for high-performance computers , 1996, The Journal of Supercomputing.

[8]  J. Ramanujam Integer Lattice Based Methods for Local Address Generation for Block-Cyclic Distributions , 2001, Compiler Optimizations for Scalable Parallel Systems Languages.

[9]  F. H. Mcmahon,et al.  The Livermore Fortran Kernels: A Computer Test of the Numerical Performance Range , 1986 .

[10]  Sarita V. Adve,et al.  Shared Memory Consistency Models: A Tutorial , 1996, Computer.

[11]  Guy L. Steele,et al.  The Java Language Specification , 1996 .

[12]  Joel H. Saltz,et al.  Parallelizing Molecular Dynamics Programs for Distributed Memory Machines: An Application of the Cha , 1994 .

[13]  Guy L. Steele,et al.  The High Performance Fortran Handbook , 1993 .

[14]  Robert W. Numrich,et al.  Co-array Fortran for parallel programming , 1998, FORF.

[15]  Jason Maassen,et al.  An efficient implementation of Java's remote method invocation , 1999, PPoPP '99.

[16]  Robert C. Spicer,et al.  Author's biography , 1993 .

[17]  Laxmikant V. Kalé,et al.  NAMD: Biomolecular Simulation on Thousands of Processors , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[18]  Samuel P. Midkiff,et al.  Java programming for high-performance numerical computing , 2000, IBM Syst. J..

[19]  Laxmikant V. Kale,et al.  Charisma: A Component Architecture for Parallel Programming , 2002 .

[20]  Laxmikant V. Kalé,et al.  Run-Time Support for Adaptive Load Balancing , 2000, IPDPS Workshops.

[21]  Monica S. Lam,et al.  Communication optimization and code generation for distributed memory machines , 1993, PLDI '93.

[22]  Veljko Milutinovic,et al.  The Cache Coherence Problem in Shared-Memory Multiprocessors: Software Solutions , 1996 .

[23]  Ken Kennedy,et al.  Computer support for machine-independent parallel programming in Fortran D , 1992 .

[24]  Willy Zwaenepoel,et al.  Munin: distributed shared memory based on type-specific memory coherence , 1990, PPOPP '90.

[25]  Alan L. Cox,et al.  Java/DSM: A Platform for Heterogeneous Computing , 1997, Concurr. Pract. Exp..

[26]  Laxmikant V. Kalé,et al.  A Parallel Framework for Explicit FEM , 2000, HiPC.

[27]  Charles Koelbel,et al.  Compiling Global Name-Space Parallel Loops for Distributed Execution , 1991, IEEE Trans. Parallel Distributed Syst..

[28]  Willy Zwaenepoel,et al.  Techniques for reducing consistency-related communication in distributed shared-memory systems , 1995, TOCS.

[29]  Lawrence Rauchwerger,et al.  Polaris: Improving the Effectiveness of Parallelizing Compilers , 1994, LCPC.

[30]  Harry Berryman,et al.  Run-Time Scheduling and Execution of Loops on Message Passing Machines , 1990, J. Parallel Distributed Comput..

[31]  Laxmikant V. Kalé,et al.  Debugging support for Charm++ , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[32]  Yannis Smaragdakis,et al.  J-Orchestra: Automatic Java Application Partitioning , 2002, ECOOP.

[33]  Laxmikant V. Kalé,et al.  CHARM++: a portable concurrent object oriented system based on C++ , 1993, OOPSLA '93.

[34]  Laxmikant V. Kale,et al.  NAMD2: Greater Scalability for Parallel Molecular Dynamics , 1999 .

[35]  Laxmikant V. Kalé,et al.  Supporting Machine Independent Programming on Diverse Parallel Architectures , 1991, ICPP.

[36]  Laxmikant V. Kalé,et al.  FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI , 2004, 2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935).

[37]  Jason Maassen,et al.  Efficient Java RMI for parallel programming , 2001, TOPL.

[38]  Philip J. Hatcher,et al.  Executing Java threads in parallel in a distributed-memory environment , 1998, CASCON.

[39]  Rudolf Eigenmann,et al.  Combined compile-time and runtime-driven, pro-active data movement in software DSM systems , 2004 .

[40]  Galen C. Hunt,et al.  The Coign automatic distributed partitioning system , 1999, OSDI '99.

[41]  Alan L. Cox,et al.  TreadMarks: Distributed Shared Memory on Standard Workstations and Operating Systems , 1994, USENIX Winter.

[42]  Jimmy Su,et al.  Array prefetching for irregular array accesses in Titanium , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[43]  Liviu Iftode,et al.  Shared virtual memory: progress and challenges , 1999 .

[44]  D.A. Reed,et al.  An Integrated Compilation and Performance Analysis Environment for Data Parallel Programs , 1995, Proceedings of the IEEE/ACM SC95 Conference.

[45]  Leslie G. Valiant,et al.  A bridging model for parallel computation , 1990, CACM.

[46]  Michael Philippsen,et al.  More Efficient Object Serialization , 1999, IPPS/SPDP Workshops.

[47]  Samuel P. Midkiff,et al.  High Performance Numerical Computing in Java: Language and Compiler Issues , 1999, LCPC.

[48]  Laxmikant V. Kalé,et al.  Converse: an interoperable framework for parallel programming , 1996, Proceedings of International Conference on Parallel Processing.

[49]  Laxmikant V. Kalé,et al.  A Malleable-Job System for Timeshared Parallel Machines , 2002, 2nd IEEE/ACM International Symposium on Cluster Computing and the Grid (CCGRID'02).

[50]  Laxmikant V. Kalé,et al.  Adaptive Load Balancing for MPI Programs , 2001, International Conference on Computational Science.

[51]  Willy Zwaenepoel,et al.  Adaptive software cache management for distributed shared memory architectures , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[52]  Laxmikant V. Kalé,et al.  Adaptive MPI , 2003, LCPC.

[53]  Steven J. Plimpton,et al.  A new parallel method for molecular dynamics simulation of macromolecular systems , 1994, J. Comput. Chem..

[54]  Michael Philippsen,et al.  JavaParty – transparent remote objects in Java , 1997 .

[55]  Harry Berryman,et al.  A manual for PARTI runtime primitives , 1990 .

[56]  Steven J. Deitz,et al.  The High-Level Parallel Language ZPL Improves Productivity and Performance , 2004 .

[57]  Laxmikant V. Kalé,et al.  MSA: Multiphase Specifically Shared Arrays , 2004, LCPC.

[58]  Michael Philippsen,et al.  JavaParty - Transparent Remote Objects in Java , 1997, Concurr. Pract. Exp..

[59]  Michael Philippsen,et al.  A more efficient RMI for Java , 1999, JAVA '99.