Capsules: Expressing Composable Computations in a Parallel Programming Model

A well-known problem in designing high-level parallel programming models and languages is the "granularity problem", where the execution of parallel task instances that are too fine-grain incur large overheads in the parallel run-time and decrease the speed-up achieved by parallel execution. On the other hand, tasks that are too coarse-grain create load-imbalance and do not adequately utilize the parallel machine. In this work we attempt to address this issue with a concept of expressing "composable computations" in a parallel programming model called "Capsules". Such composability allows adjustment of execution granularity at run-time. In Capsules, we provide a unifying framework that allows composition and adjustment of granularity for both data and computation over iteration space and computation space. We show that this concept not only allows the user to express the decision on granularity of execution, but also the decision on the granularity of garbage collection, and other features that may be supported by the programming model. We argue that this adaptability of execution granularity leads to efficient parallel execution by matching the available application concurrency to the available hardware concurrency, thereby reducing parallelization overhead. By matching, we refer to creating coarse-grain Computation Capsules, that encompass multiple instances of fine-grain computation instances. In effect, creating coarse-grain computations reduces overhead by simply reducing the number of parallel computations. This leads to: (1) Reduced synchronization cost such as for blocked searches in shared data-structures; (2) Reduced distribution and scheduling cost for parallel computation instances; and (3) Reduced book-keeping cost maintain data-structures such as for unfulfilled data requests. Capsules builds on our prior work, TStreams, a data-flow oriented parallel programming framework. Our results on an SMP machine using the Cascade Face Detector, and the Stereo Vision Depth applications show that adjusting execution granularity through profiling helps determine optimal coarse-grain serial execution granularity, reduces parallelization overhead and yields maximum application performance.

[1]  J. Ramanujam,et al.  Tiling Multidimensional Itertion Spaces for Multicomputers , 1992, J. Parallel Distributed Comput..

[2]  Samuel Williams,et al.  The Landscape of Parallel Computing Research: A View from Berkeley , 2006 .

[3]  James M. Rehg,et al.  Stampede: A Cluster Programming Middleware for Interactive Stream-Oriented Applications , 2003, IEEE Trans. Parallel Distributed Syst..

[4]  Anthony Skjellum,et al.  Using MPI - portable parallel programming with the message-parsing interface , 1994 .

[5]  Richard Szeliski,et al.  A Taxonomy and Evaluation of Dense Two-Frame Stereo Correspondence Algorithms , 2001, International Journal of Computer Vision.

[6]  Paul A. Viola,et al.  Rapid object detection using a boosted cascade of simple features , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[7]  Paul R. Wilson,et al.  Uniprocessor Garbage Collection Techniques , 1992, IWMM.

[8]  Jack Dongarra,et al.  MPI: The Complete Reference , 1996 .

[9]  Arvind,et al.  Executing a Program on the MIT Tagged-Token Dataflow Architecture , 1990, IEEE Trans. Computers.

[10]  Umakishore Ramachandran,et al.  A Comparative Study of Stampede Garbage Collection Algorithms , 2002, LCPC.

[11]  David Gelernter,et al.  Generative communication in Linda , 1985, TOPL.

[12]  David R. Jefferson,et al.  Virtual time , 1985, ICPP.

[13]  Anwar Ghuloum Future Proof Data Parallel Algorithms and Software on Intel Multicore Architecture , 2007 .

[14]  Nicholas Carriero,et al.  Linda in context , 1989, CACM.

[15]  Umakishore Ramachandran,et al.  Distributed Garbage Collection Algorithms for Timestamped Data , 2006, IEEE Transactions on Parallel and Distributed Systems.

[16]  H. Peter Hofstee,et al.  Introduction to the Cell multiprocessor , 2005, IBM J. Res. Dev..

[17]  Takeo Kanade,et al.  A statistical method for 3D object detection applied to faces and cars , 2000, Proceedings IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2000 (Cat. No.PR00662).

[18]  William Gropp,et al.  MPI: The Complete Reference , Vol. 2 - The MPI-2 Extensions , 1998 .

[19]  Peter R. Jones,et al.  Implementation and Evaluation , 1995 .

[20]  Jack Dongarra,et al.  MPI - The Complete Reference: Volume 1, The MPI Core , 1998 .

[21]  Rafael Dueire Lins,et al.  Garbage collection: algorithms for automatic dynamic memory management , 1996 .

[22]  Jesús Labarta,et al.  Programming Grid Applications with GRID Superscalar , 2003, Journal of Grid Computing.

[23]  RamachandranUmakishore,et al.  Distributed Garbage Collection Algorithms for Timestamped Data , 2006 .

[24]  Harry Shum,et al.  Image-based rendering , 2006, Found. Trends Comput. Graph. Vis..

[25]  Andrew D. Christian,et al.  Digital smart kiosk project , 1998, CHI.

[26]  Guy E. Blelloch,et al.  NESL: A Nested Data-Parallel Language , 1992 .

[27]  G. Ramalingam,et al.  Context-sensitive synchronization-sensitive analysis is undecidable , 2000, TOPL.

[28]  Daniel L. Neill,et al.  On the Benefits of Work Stealing in Shared-Memory Multiprocessors , 2022 .

[29]  James M. Rehg,et al.  Space-time memory: a parallel programming abstraction for interactive multimedia applications , 1999, PPoPP '99.

[30]  T. von Eicken,et al.  Parallel programming in Split-C , 1993, Supercomputing '93.

[31]  Umakishore Ramachandran,et al.  Garbage collection of timestamped data in Stampede , 2000, PODC '00.

[32]  Umakishore Ramachandran,et al.  Adaptive resource utilization via feedback control for streaming applications , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[33]  Takeo Kanade,et al.  A statistical approach to 3d object detection applied to faces and cars , 2000 .

[34]  Keshav Pingali,et al.  Optimistic parallelism requires abstractions , 2007, PLDI '07.

[35]  U. Ramachandran,et al.  Scheduling Constrained Dynamic Applications on Clusters , 1999, ACM/IEEE SC 1999 Conference (SC'99).

[36]  James M. Rehg,et al.  Stampede: A Programming System for Emerging Scalable Interactive Multimedia Applications , 1998, LCPC.

[37]  Bowen Alpern,et al.  Hierarchical Tiling: A Methodology for High Performance , 1996 .

[38]  Alexandru Nicolau,et al.  The Design of the PROMIS Compiler , 1999, CC.

[39]  J. Ramanujam,et al.  Tiling multidimensional iteration spaces for nonshared memory machines , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).

[40]  Alan L. Cox,et al.  TreadMarks: shared memory computing on networks of workstations , 1996 .

[41]  Mitsuhisa Sato,et al.  Performance Evaluation of the Omni OpenMP Compiler , 2000, ISHPC.

[42]  James M. Rehg,et al.  Vision for a smart kiosk , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[43]  James M. Rehg,et al.  Space-Time Memory: A Parallel Programming Abstraction for Dynamic Vision Applications , 2002 .

[44]  James R. Larus,et al.  Software and the Concurrency Revolution , 2005, ACM Queue.

[45]  Andrea C. Arpaci-Dusseau,et al.  Parallel programming in Split-C , 1993, Supercomputing '93. Proceedings.

[46]  Alexandru Nicolau,et al.  The Design of the PROMIS Compiler—Towards Multi-Level Parallelization , 2004, International Journal of Parallel Programming.

[47]  Monica S. Lam,et al.  Jade: a high-level, machine-independent language for parallel programming , 1993, Computer.

[48]  Nicholas Carriero,et al.  A computational model of everything , 2001, CACM.

[49]  Monica S. Lam,et al.  Coarse-grain parallel programming in Jade , 1991, PPOPP '91.

[50]  Tobin J. Lehman,et al.  T Spaces : The Next Wave , 2004 .

[51]  C RinardMartin,et al.  Coarse-grain parallel programming in Jade , 1991 .

[52]  Ruigang Yang,et al.  A versatile stereo implementation on commodity graphics hardware , 2005, Real Time Imaging.

[53]  Laxmikant V. Kale,et al.  The Charm Parallel Programming Language and System: Part I - Description of Language Features , 1994 .

[54]  Rishiyur S. Nikhil,et al.  Integrated task and data parallel support for dynamic applications , 1999 .

[55]  Monica S. Lam,et al.  Heterogeneous parallel programming in Jade , 1992, Proceedings Supercomputing '92.

[56]  Gary Sabot The paralation model - architecture-independent parallel programming , 1988 .

[57]  Laxmikant V. Kalé,et al.  CHARM++: a portable concurrent object oriented system based on C++ , 1993, OOPSLA '93.

[58]  Rosa M. Badia,et al.  CellSs: a Programming Model for the Cell BE Architecture , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[59]  Umakishore Ramachandran,et al.  Dead timestamp identification in Stampede , 2002, Proceedings International Conference on Parallel Processing.

[60]  Sing Bing Kang,et al.  Survey of image-based rendering techniques , 1998, Electronic Imaging.

[61]  Antonia Zhai,et al.  A scalable approach to thread-level speculation , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[62]  Gul A. Agha,et al.  ACTORS - a model of concurrent computation in distributed systems , 1985, MIT Press series in artificial intelligence.

[63]  G. Amdhal,et al.  Validity of the single processor approach to achieving large scale computing capabilities , 1967, AFIPS '67 (Spring).

[64]  Constantine D. Polychronopoulos,et al.  Automatic Granularity Selection and OpenMP Directive Generation Via Extended Machine Descriptors in the PROMIS Parallelizing Compiler , 2006, IWOMP.

[65]  Bradley C. Kuszmaul,et al.  Cilk: an efficient multithreaded runtime system , 1995, PPOPP '95.

[66]  Barry Wilkinson,et al.  Parallel programming , 1998 .

[67]  Michael Gschwind,et al.  Using advanced compiler technology to exploit the performance of the Cell Broadband EngineTM architecture , 2006, IBM Syst. J..

[68]  James M. Rehg,et al.  Integrated Task and Data Parallel Support for Dynamic Applications , 1998, LCR.

[69]  Edward A. Lee The problem with threads , 2006, Computer.

[70]  * Internal Accession Date Only Approved for External Publication © Copyright 2006 Hewlett-Packard Development Company, L.P.Slicing the Transform- A Discriminative Approach for , 2004 .

[71]  James M. Rehg,et al.  Computer Vision for Human–Machine Interaction: Visual Sensing of Humans for Active Public Interfaces , 1998 .

[72]  Narain H. Gehani,et al.  Capsules: A Shared Memory Access Mechanism for Concurrent C/C++ , 1993, IEEE Trans. Parallel Distributed Syst..