论文信息 - SMARTS: exploiting temporal locality and parallelism through vertical execution

SMARTS: exploiting temporal locality and parallelism through vertical execution

In the solution of large-scale numerical prob- lems, parallel computing is becoming simultaneously more important and more difficult. The complex organization of today's multiprocessors with several memory hierarchies has forced the scientific programmer to make a choice between simple but unscalable code and scalable but extremely com- plex code that does not port to other architectures. This paper describes how the SMARTS runtime system and the POOMA C++ class library for high-performance scientific computing work together to exploit data parallelism in scientific applications while hiding the details of manag- ing parallelism and data locality from the user. We present innovative algorithms, based on the macro -dataflow model, for detecting data parallelism and efficiently executing data- parallel statements on shared-memory multiprocessors. We also desclibe how these algorithms can be implemented on clusters of SMPS.

[1] David C. Cann,et al. A Report on the Sisal Language Project , 1990, J. Parallel Distributed Comput..

[2] Bradley C. Kuszmaul,et al. Cilk: an efficient multithreaded runtime system , 1995, PPOPP '95.

[3] Bradley C. Kuszmaul,et al. Cilk: an efficient multithreaded runtime system , 1995, PPOPP '95.

[4] Steve Karmesin,et al. Generic Programming in POOMA and PETE , 1998, Generic Programming.

[5] Evangelos P. Markatos,et al. Load Balancing vs. Locality Management in Shared-Memory Multiprocessors , 1992, ICPP.

[6] J. van Leeuwen,et al. Computing in Object-Oriented Parallel Environments , 1999, Lecture Notes in Computer Science.

[7] Keshav Pingali,et al. Transformations for Imperfectly Nested Loops , 1996, Proceedings of the 1996 ACM/IEEE Conference on Supercomputing.

[8] Andrew S. Grimshaw,et al. Dynamic, object-oriented parallel processing , 1993, IEEE Parallel & Distributed Technology: Systems & Applications.

[9] V. Sarkar,et al. Collective Loop Fusion for Array Contraction , 1992, LCPC.

[10] Ian Watson,et al. The Manchester prototype dataflow computer , 1985, CACM.

[11] Andrew S. Grimshaw,et al. Easy-to-use object-oriented parallel processing with Mentat , 1993, Computer.

[12] Todd L. Veldhuizen,et al. Expression templates , 1996 .

[13] CONSTANTINE D. POLYCHRONOPOULOS,et al. Guided Self-Scheduling: A Practical Scheduling Scheme for Parallel Supercomputers , 1987, IEEE Transactions on Computers.

[14] Michael E. Wolf,et al. Improving locality and parallelism in nested loops , 1992 .

[15] Steve Karmesin,et al. Array Design and Expression Evaluation in POOMA II , 1998, ISCOPE.

[16] D. Grunwald,et al. Loop Re-Ordering and Pre-Fetching at Run-time , 1997, ACM/IEEE SC 1997 Conference (SC'97).

[17] Robert G. Babb,et al. Parallel Processing with Large-Grain Data Flow Techniques , 1984, Computer.

[18] Hanan Samet,et al. The Design and Analysis of Spatial Data Structures , 1989 .

[19] Ken Kennedy,et al. Maximizing Loop Parallelism and Improving Data Locality via Loop Fusion and Distribution , 1993, LCPC.

[20] Hiroki Honda,et al. A Multi-Grain Parallelizing Compilation Scheme for OSCAR (Optimally Scheduled Advanced Multiprocessor) , 1991, LCPC.

[21] Hironori Kasahara,et al. A Data-Localization Compilation Scheme Using Partial-Static Task Assignment for Fortran Coarse-Grain Parallel Processing , 1998, Parallel Comput..

[22] Jordi Torres,et al. Loop parallelization: revisiting framework of unimodular transformations , 1996, Proceedings of 4th Euromicro Workshop on Parallel and Distributed Processing.

[23] Allen D. Malony,et al. Portable profiling and tracing for parallel, scientific applications using C++ , 1998, SPDT '98.

[24] Allen D. Malony,et al. Dynamic Performance Callstack Sampling: Merging TAU and DAQV , 1998, PARA.