SMARTS: exploiting temporal locality and parallelism through vertical execution

In the solution of large-scale numerical prob- lems, parallel computing is becoming simultaneously more important and more difficult. The complex organization of today's multiprocessors with several memory hierarchies has forced the scientific programmer to make a choice between simple but unscalable code and scalable but extremely com- plex code that does not port to other architectures. This paper describes how the SMARTS runtime system and the POOMA C++ class library for high-performance scientific computing work together to exploit data parallelism in scientific applications while hiding the details of manag- ing parallelism and data locality from the user. We present innovative algorithms, based on the macro -dataflow model, for detecting data parallelism and efficiently executing data- parallel statements on shared-memory multiprocessors. We also desclibe how these algorithms can be implemented on clusters of SMPS.

[1]  David C. Cann,et al.  A Report on the Sisal Language Project , 1990, J. Parallel Distributed Comput..

[2]  Bradley C. Kuszmaul,et al.  Cilk: an efficient multithreaded runtime system , 1995, PPOPP '95.

[3]  Bradley C. Kuszmaul,et al.  Cilk: an efficient multithreaded runtime system , 1995, PPOPP '95.

[4]  Steve Karmesin,et al.  Generic Programming in POOMA and PETE , 1998, Generic Programming.

[5]  Evangelos P. Markatos,et al.  Load Balancing vs. Locality Management in Shared-Memory Multiprocessors , 1992, ICPP.

[6]  J. van Leeuwen,et al.  Computing in Object-Oriented Parallel Environments , 1999, Lecture Notes in Computer Science.

[7]  Keshav Pingali,et al.  Transformations for Imperfectly Nested Loops , 1996, Proceedings of the 1996 ACM/IEEE Conference on Supercomputing.

[8]  Andrew S. Grimshaw,et al.  Dynamic, object-oriented parallel processing , 1993, IEEE Parallel & Distributed Technology: Systems & Applications.

[9]  V. Sarkar,et al.  Collective Loop Fusion for Array Contraction , 1992, LCPC.

[10]  Ian Watson,et al.  The Manchester prototype dataflow computer , 1985, CACM.

[11]  Andrew S. Grimshaw,et al.  Easy-to-use object-oriented parallel processing with Mentat , 1993, Computer.

[12]  Todd L. Veldhuizen,et al.  Expression templates , 1996 .

[13]  CONSTANTINE D. POLYCHRONOPOULOS,et al.  Guided Self-Scheduling: A Practical Scheduling Scheme for Parallel Supercomputers , 1987, IEEE Transactions on Computers.

[14]  Michael E. Wolf,et al.  Improving locality and parallelism in nested loops , 1992 .

[15]  Steve Karmesin,et al.  Array Design and Expression Evaluation in POOMA II , 1998, ISCOPE.

[16]  D. Grunwald,et al.  Loop Re-Ordering and Pre-Fetching at Run-time , 1997, ACM/IEEE SC 1997 Conference (SC'97).

[17]  Robert G. Babb,et al.  Parallel Processing with Large-Grain Data Flow Techniques , 1984, Computer.

[18]  Hanan Samet,et al.  The Design and Analysis of Spatial Data Structures , 1989 .

[19]  Ken Kennedy,et al.  Maximizing Loop Parallelism and Improving Data Locality via Loop Fusion and Distribution , 1993, LCPC.

[20]  Hiroki Honda,et al.  A Multi-Grain Parallelizing Compilation Scheme for OSCAR (Optimally Scheduled Advanced Multiprocessor) , 1991, LCPC.

[21]  Hironori Kasahara,et al.  A Data-Localization Compilation Scheme Using Partial-Static Task Assignment for Fortran Coarse-Grain Parallel Processing , 1998, Parallel Comput..

[22]  Jordi Torres,et al.  Loop parallelization: revisiting framework of unimodular transformations , 1996, Proceedings of 4th Euromicro Workshop on Parallel and Distributed Processing.

[23]  Allen D. Malony,et al.  Portable profiling and tracing for parallel, scientific applications using C++ , 1998, SPDT '98.

[24]  Allen D. Malony,et al.  Dynamic Performance Callstack Sampling: Merging TAU and DAQV , 1998, PARA.