Portable library support for irregular applications

Building portable parallel programs on distributed memory multiprocessors and workstation networks is a complex task that is greatly facilitated by powerful infrastructure. In this dissertation, we develop important components of that infrastructure, focusing on irregular applications such as unstructured mesh computations, search problems, and discrete event simulation. We use a library-based approach to building such applications. The library provides a uniform programming interface on multiple platforms and has highly tuned implementations developed by the library programmer. Therefore, applications built on the library can be portable both in functionality and in performance. We describe the major components of our parallel data structure library called Multipol, including two of the more irregular data structures and one application. The two data structures are a task stealer for dynamic load balancing and an event graph for discrete event simulation. The application is a timing-level circuit simulator for combinational circuits. We analyze the workloads of several applications built by the Multipol group and quantitatively characterize their irregularities. The Multipol library is built on a runtime layer consisting of threads as well as communication mechanisms. The thread layer supports a basic computational abstraction called fibers, which are code sequences that appear to execute atomically. The fiber abstraction enables a portable multithreading execution environment for latency hiding. The thread layer also allows the programmer to supply customized schedulers to enforce application-specific scheduling policies. The communication layer provides portable primitives for expressing irregular communication. It uses a technique called message aggregation to trade the excess parallelism in the application for better communication bandwidth. We provide a new performance profiling toolkit called Mprof to help tune the performance of irregular parallel programs. Mprof identifies two major sources of performance inefficiency: overhead and insufficient parallelism. It uses statistical modeling to extract reusable cost models from benchmark executions. The cost models are combined with high-level statistics collected from an actual execution to provide low-overhead profiling information. Mprof also provides a performance interface for the library programmer to customize the profiling information and thereby preserve the library abstraction. Using information from Mprof, we optimize the performance of several irregular applications and demonstrate the performance portability of the Multipol library and runtime layer.

[1]  Jack Dongarra,et al.  A User''s Guide to PVM Parallel Virtual Machine , 1991 .

[2]  Robert H. Halstead,et al.  Lazy task creation: a technique for increasing the granularity of parallel programs , 1990, LISP and Functional Programming.

[3]  David E. Culler,et al.  Fine-grain parallelism with minimal hardware support: a compiler-controlled threaded abstract machine , 1991, ASPLOS IV.

[4]  J.A. Jones,et al.  Parallelizing the Phylogeny Problem , 1995, Proceedings of the IEEE/ACM SC95 Conference.

[5]  Thomas E. Anderson,et al.  Quartz: a tool for tuning parallel program performance , 1990, SIGMETRICS '90.

[6]  Jack Dongarra,et al.  ScaLAPACK: a scalable linear algebra library for distributed memory concurrent computers , 1992, [Proceedings 1992] The Fourth Symposium on the Frontiers of Massively Parallel Computation.

[7]  N. S. Barnett,et al.  Private communication , 1969 .

[8]  M. F.,et al.  Bibliography , 1985, Experimental Gerontology.

[9]  Joel H. Saltz,et al.  Communication Optimizations for Irregular Scientific Computations on Distributed Memory Architectures , 1994, J. Parallel Distributed Comput..

[10]  Constantine D. Polychronopoulos,et al.  The structure of parafrase-2: an advanced parallelizing compiler for C and FORTRAN , 1990 .

[11]  David R. Jefferson,et al.  Virtual time , 1985, ICPP.

[12]  Malgorzata Marek-Sadowska,et al.  SWEC: a stepwise equivalent conductance timing simulator for CMOS VLSI circuits , 1991, Proceedings of the European Conference on Design Automation..

[13]  Eric A. Brewer,et al.  Portable high-performance superconducting: high-level platform-dependent optimization , 1994 .

[14]  Laxmikant V. Kalé,et al.  Chare Kernel - a Runtime Support System for Parallel Computations , 1991, J. Parallel Distributed Comput..

[15]  Mark Crovella,et al.  Performance Prediction and Tuning of Parallel Programs , 1994 .

[16]  John K. Ousterhout,et al.  Tcl and the Tk Toolkit , 1994 .

[17]  Ray Jain,et al.  The art of computer systems performance analysis - techniques for experimental design, measurement, simulation, and modeling , 1991, Wiley professional computing.

[18]  Seth Copen,et al.  ENABLING PRIMITIVES FOR COMPILING PARALLEL LANGUAGES , 1995 .

[19]  W. Daniel Hillis,et al.  The Network Architecture of the Connection Machine CM-5 , 1996, J. Parallel Distributed Comput..

[20]  Joel H. Saltz,et al.  Applying the CHAOS/PARTI library to irregular problems in computational chemistry and computational aerodynamics , 1993, Proceedings of Scalable Parallel Libraries Conference.

[21]  Mark Crovella,et al.  Performance debugging using parallel performance predicates , 1993, PADD '93.

[22]  Katherine A. Yelick,et al.  Parallel timing simulation on a distributed memory multiprocessor , 1993, ICCAD.

[23]  Richard P. Martin,et al.  LogP Performance Assessment of Fast Network Interfaces , 1995 .

[24]  Wilson C. Hsieh,et al.  Optimistic active messages: a mechanism for scheduling communication with computation , 1995, PPOPP '95.

[25]  Wisconsin , 1955 .

[26]  Charles L. Seitz,et al.  Myrinet: A Gigabit-per-Second Local Area Network , 1995, IEEE Micro.

[27]  G. A. Geist,et al.  A user's guide to PICL a portable instrumented communication library , 1990 .

[28]  Corporate Rice University,et al.  High performance Fortran language specification , 1993, FORF.

[29]  Ken Kennedy,et al.  Compiler optimizations for Fortran D on MIMD distributed-memory machines , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).

[30]  TWO-WEEK Loan COpy,et al.  University of California , 1886, The American journal of dental science.

[31]  J PritchardD Concurrency: Practice and Experience , 1991 .

[32]  Katherine Yelick,et al.  Randomized load balancing for tree-structured computation , 1994, Proceedings of IEEE Scalable High Performance Computing Conference.

[33]  Leslie Lamport,et al.  Distributed snapshots: determining global states of distributed systems , 1985, TOCS.

[34]  Seth Copen Goldstein,et al.  Active messages: a mechanism for integrating communication and computation , 1998, ISCA '98.

[35]  Anne Rogers,et al.  Software caching and computation migration in Olden , 1995, PPOPP '95.

[36]  Kinson Ho High-level abstractions for symbolic parallel programming , 1994 .

[37]  TD Ameritrade THE UNIVERSITY OF ROCHESTER , 1998 .

[38]  B. Miller,et al.  The Paradyn Parallel Performance Measurement Tools , 1995 .

[39]  J. Demmel,et al.  LAPACK: a portable linear algebra library for supercomputers , 1989, IEEE Control Systems Society Workshop on Computer-Aided Control System Design.

[40]  Barton P. Miller,et al.  Parallel program performance metrics: a comparison and validation , 1992, Proceedings Supercomputing '92.

[41]  Andrea C. Arpaci-Dusseau,et al.  Parallel programming in Split-C , 1993, Supercomputing '93. Proceedings.

[42]  D.A. Reed,et al.  Scalable performance analysis: the Pablo performance analysis environment , 1993, Proceedings of Scalable Parallel Libraries Conference.

[43]  Katherine A. Yelick,et al.  Implementing an irregular application on a distributed memory multiprocessor , 1993, PPOPP '93.

[44]  Barton P. Miller,et al.  Critical path analysis for the execution of parallel and distributed programs , 1988, [1988] Proceedings. The 8th International Conference on Distributed.

[45]  Bradley C. Kuszmaul,et al.  Cilk: an efficient multithreaded runtime system , 1995, PPOPP '95.

[46]  S.R. Kohn,et al.  A Parallel Software Infrastructure for Structured Adaptive Mesh Methods , 1995, Proceedings of the IEEE/ACM SC95 Conference.

[47]  Message Passing Interface Forum MPI: A message - passing interface standard , 1994 .

[48]  Robert H. Halstead,et al.  Understanding the Performance of Parallel Symbolic Programs , 1995, PSLS.

[49]  Laxmikant V. Kalé,et al.  CHARM++: a portable concurrent object oriented system based on C++ , 1993, OOPSLA '93.

[50]  K. Mani Chandy,et al.  Asynchronous distributed simulation via a sequence of parallel computations , 1981, CACM.

[51]  Monica Sin-Ling Lam,et al.  A Systolic Array Optimizing Compiler , 1989 .

[52]  J. Larus,et al.  Eecient Support for Irregular Applications on Distributed-memory Machines , 1995 .

[53]  Katherine A. Yelick,et al.  Portable Runtime Support for Asynchronous Simulation , 1995, ICPP.

[54]  Mike Johnson,et al.  Superscalar microprocessor design , 1991, Prentice Hall series in innovative technology.

[55]  Randal E. Bryant,et al.  Efficient implementation of a BDD package , 1991, DAC '90.