Semantic Language Extensions for Implicit Parallel Programming

Abstract : Several emerging fields of science and engineering are increasingly characterized by computationally intensive programs. Without parallelization, such programs do not benefit from the increasing core counts available in todays chip multiprocessors. However, writing correct and well-performing parallel programs is widely perceived to be an extremely hard problem. In order to understand the challenges faced by scientific programmers in effectively leveraging parallel computation, this dissertation first presents an in-depth field study of the practice of computational science. Based on the results of the field study, this dissertation proposes two new implicit parallel programming (IPP) solutions. With IPP, artificial constraints imposed by sequential models for automatic parallelization are overcome by use of semantic programming extensions. These preserve the ease of sequential programming and enable multiple parallelism forms without additional parallelism constructs, achieving the best of both automatic and explicit parallelization. The first IPP solution, Commutative Set, generalizes existing notions of semantic commutativity. It allows a programmer to relax execution orders prohibited under a sequential programming model with a high degree of expressiveness. The second IPP solution WeakC, provides language extensions to relax strict consistency requirements of sequential data structures, and dynamically optimizes a parallel configuration of these data structures via a combined compiler-runtime system. This dissertation evaluates both Commutative Set and WeakC on real-world applications running on real hardware, including some that are actively used by some scientists in their day-to-day research. The detailed experimental evaluation results demonstrate the effectiveness of the proposed techniques.

[1]  Nathan Clark,et al.  Commutativity analysis for software parallelization: letting program transformations see the big picture , 2009, ASPLOS.

[2]  Rohit Chandra,et al.  Parallel programming in openMP , 2000 .

[3]  Santosh Pande,et al.  Efficiently speeding up sequential computation through the n-way programming model , 2011, OOPSLA '11.

[4]  Chen Ji,et al.  A 14.6 billion degrees of freedom, 5 teraflops, 2.5 terabyte earthquake simulation on the Earth Simulator , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[5]  Sharon C. Glotzer,et al.  HOOMD-blue, general-purpose many-body dynamics on the GPU , 2010 .

[6]  Kunle Olukotun,et al.  The OpenTM Transactional Application Programming Interface , 2007, 16th International Conference on Parallel Architecture and Compilation Techniques (PACT 2007).

[7]  Ken Kennedy,et al.  The D Editor: a new interactive parallel programming tool , 1994, Proceedings of Supercomputing '94.

[8]  S. I. Feldman,et al.  A Fortran to C converter , 1990, FORF.

[9]  Maurice Herlihy,et al.  Transactional boosting: a methodology for highly-concurrent transactional objects , 2008, PPoPP.

[10]  Chau-Wen Tseng,et al.  Improving compiler and run-time support for adaptive irregular codes , 1998, Proceedings. 1998 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.98EX192).

[11]  Maurice Herlihy,et al.  Coarse-grained transactions , 2010, POPL '10.

[12]  Niklaus Wirth,et al.  A Plea for Lean Software , 1995, Computer.

[13]  Sarita V. Adve,et al.  Shared Memory Consistency Models: A Tutorial , 1996, Computer.

[14]  Curtis R. Cook,et al.  Are expectations for parallelism too high? a survey of potential parallel users , 1994, Proceedings of Supercomputing '94.

[15]  Alejandro Duran,et al.  The Design of OpenMP Tasks , 2009, IEEE Transactions on Parallel and Distributed Systems.

[16]  Easwaran Raman,et al.  Speculative Decoupled Software Pipelining , 2007, 16th International Conference on Parallel Architecture and Compilation Techniques (PACT 2007).

[17]  Martin C. Carlisle,et al.  Olden: parallelizing programs with dynamic data structures on distributed-memory machines , 1996 .

[18]  Anthony Skjellum,et al.  Using MPI - portable parallel programming with the message-parsing interface , 1994 .

[19]  Michael F. P. O'Boyle,et al.  Towards a holistic approach to auto-parallelization: integrating profile-driven parallelism detection and machine-learning based mapping , 2009, PLDI '09.

[20]  Jeffrey Overbey,et al.  A type and effect system for deterministic parallel Java , 2009, OOPSLA 2009.

[21]  David A. Padua,et al.  Experience in the Automatic Parallelization of Four Perfect-Benchmark Programs , 1991, LCPC.

[22]  Michael F. P. O'Boyle,et al.  Partitioning streaming parallelism for multi-cores: A machine learning based approach , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[23]  Saturnino Garcia,et al.  Kremlin: rethinking and rebooting gprof for the multicore age , 2011, PLDI '11.

[24]  Todd C. Mowry,et al.  The potential for using thread-level data speculation to facilitate automatic parallelization , 1998, Proceedings 1998 Fourth International Symposium on High-Performance Computer Architecture.

[25]  D. E. Stevenson,et al.  Science, computational science, and computer science: at a crossroads , 1994, CACM.

[26]  Jonathan Eastep,et al.  Smart data structures: an online machine learning approach to multicore data structures , 2011, ICAC '11.

[27]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[28]  Martin C. Rinard,et al.  Verification of semantic commutativity conditions and inverse operations on linked data structures , 2011, PLDI '11.

[29]  Assaf J. Kfoury,et al.  Formal semantics of weak references , 2005, ISMM '06.

[30]  Matteo Frigo,et al.  The implementation of the Cilk-5 multithreaded language , 1998, PLDI.

[31]  Satnam Singh,et al.  Feedback directed implicit parallelism , 2007, ICFP '07.

[32]  Yun Zhang,et al.  Commutative set: a language extension for implicit parallel programming , 2011, PLDI '11.

[33]  Ron Cytron,et al.  Doacross: Beyond Vectorization for Multiprocessors , 1986, ICPP.

[34]  Zhiyuan Li,et al.  ASYNC Loop Constructs for Relaxed Synchronization , 2008, LCPC.

[35]  Michel Juillard,et al.  Dynare: a program for the resolution and simulation of dynamic models with forward variables through the use of a relaxation algorithm , 1996 .

[36]  Woongki Baek,et al.  Green: a framework for supporting energy-conscious programming using controlled approximation , 2010, PLDI '10.

[37]  Weixiong Zhang,et al.  Phase Transitions and Backbones of 3-SAT and Maximum 3-SAT , 2001, CP.

[38]  Alan Edelman,et al.  PetaBricks: a language and compiler for algorithmic choice , 2009, PLDI '09.

[39]  Jan Smans,et al.  Deadlock-Free Channels and Locks , 2010, ESOP.

[40]  Stephen John Turner,et al.  Tulipse: A Visualization Framework for User-Guided Parallelization , 2012, Euro-Par.

[41]  Keshav Pingali,et al.  How much parallelism is there in irregular applications? , 2009, PPoPP '09.

[42]  Martin C. Rinard,et al.  Commutativity analysis: a new analysis framework for parallelizing compilers , 1996, PLDI '96.

[43]  David A. Padua,et al.  Beyond Arrays - A Container-Centric Approach for Parallelization of Real-World Symbolic Applications , 1998, LCPC.

[44]  Martin C. Rinard,et al.  Eliminating synchronization bottlenecks using adaptive replication , 2003, TOPL.

[45]  Swarat Chaudhuri,et al.  Parallel programming with object assemblies , 2009, OOPSLA 2009.

[46]  Arvind,et al.  Implicit parallel programming in pH , 2001 .

[47]  Yen-Kuang Chen,et al.  The ALPBench benchmark suite for complex multimedia applications , 2005, IEEE International. 2005 Proceedings of the IEEE Workload Characterization Symposium, 2005..

[48]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[49]  Ken Kennedy,et al.  Optimizing Compilers for Modern Architectures: A Dependence-based Approach , 2001 .

[50]  George C. Necula,et al.  Specifying and checking semantic atomicity for multithreaded programs , 2011, ASPLOS XVI.

[51]  Easwaran Raman,et al.  Spice: speculative parallel iteration chunk execution , 2008, CGO '08.

[52]  Emery D. Berger,et al.  Grace: safe multithreaded programming for C/C++ , 2009, OOPSLA 2009.

[53]  Trevor Mudge,et al.  MiBench: A free, commercially representative embedded benchmark suite , 2001 .

[54]  Guilherme Ottoni,et al.  Global instruction scheduling for multi-threaded architectures , 2008 .

[55]  Kunle Olukotun,et al.  The Atomos transactional programming language , 2006, PLDI '06.

[56]  David S. Bolme,et al.  FacePerf: Benchmarks for Face Recognition Algorithms , 2007, 2007 IEEE 10th International Symposium on Workload Characterization.

[57]  Peter Sewell,et al.  Clarifying and compiling C/C++ concurrency: from C++11 to POWER , 2012, POPL '12.

[58]  Vikram S. Adve,et al.  LLVM: a compilation framework for lifelong program analysis & transformation , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[59]  Paulo F. Flores,et al.  PMSat: a parallel version of MiniSAT , 2008, J. Satisf. Boolean Model. Comput..

[60]  Hideya Iwasaki,et al.  Automatic parallelization via matrix multiplication , 2011, PLDI '11.

[61]  Berkin Özisikyilmaz,et al.  MineBench: A Benchmark Suite for Data Mining Workloads , 2006, 2006 IEEE International Symposium on Workload Characterization.

[62]  Olga G. Troyanskaya,et al.  The Sleipnir library for computational functional genomics , 2008, Bioinform..

[63]  Serdar Tasiran,et al.  An annotation assistant for interactive debugging of programs with common synchronization idioms , 2009, PADTAD '09.

[64]  Yun Zhang,et al.  Revisiting the Sequential Programming Model for Multi-Core , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[65]  Jesús Labarta,et al.  A Simulation of Seismic Wave Propagation at High Resolution in the Inner Core of the Earth on 2166 Processors of MareNostrum , 2008, VECPAR.

[66]  Larry Smarr,et al.  Supercomputing and the transformation of science , 1993 .

[67]  D. Lettenmaier,et al.  A simple hydrologically based model of land surface water and energy fluxes for general circulation models , 1994 .

[68]  Scott A. Mahlke,et al.  Parallelizing sequential applications on commodity hardware using a low-cost software transactional memory , 2009, PLDI '09.

[69]  Joshua S. Auerbach,et al.  Lime: a Java-compatible and synthesizable language for heterogeneous architectures , 2010, OOPSLA.

[70]  Engin Ipek,et al.  Coordinated management of multiple interacting resources in chip multiprocessors: A machine learning approach , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.

[71]  Ayal Zaks,et al.  Fast condensation of the program dependence graph , 2013, PLDI.

[72]  Brian Demsky,et al.  OoOJava: an out-of-order approach to parallel programming , 2010 .

[73]  Steven G. Johnson,et al.  The Design and Implementation of FFTW3 , 2005, Proceedings of the IEEE.

[74]  Chen Ding,et al.  Software behavior oriented parallelization , 2007, PLDI '07.

[75]  Vahid Tabatabaee,et al.  Parallel Parameter Tuning for Applications with Performance Variability , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[76]  Martyn Plummer,et al.  JAGS: Just Another Gibbs Sampler , 2012 .

[77]  Jeremy Kepner,et al.  The HPEC Challenge Benchmark Suite , 2006 .

[78]  Christoforos E. Kozyrakis,et al.  Evaluating MapReduce for Multi-core and Multiprocessor Systems , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[79]  Wendong Hu,et al.  NetBench: a benchmarking suite for network processors , 2001, IEEE/ACM International Conference on Computer Aided Design. ICCAD 2001. IEEE/ACM Digest of Technical Papers (Cat. No.01CH37281).

[80]  Marco Dorigo,et al.  The hyper-cube framework for ant colony optimization , 2004, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[81]  Hsien-Hsin S. Lee,et al.  Kicking the tires of software transactional memory: why the going gets tough , 2008, SPAA '08.

[82]  Ken Kennedy,et al.  Interactive Parallel Programming using the ParaScope Editor , 1991, IEEE Trans. Parallel Distributed Syst..

[83]  Saturnino Garcia,et al.  Kismet: parallel speedup estimates for serial programs , 2011, OOPSLA '11.

[84]  Alan Sussman,et al.  AARTS: low overhead online adaptive auto-tuning , 2011, EXADAPT '11.

[85]  Keshav Pingali,et al.  Optimistic parallelism requires abstractions , 2007, PLDI '07.

[86]  R. E. Kurt Stirewalt,et al.  Incremental dependence analysis for interactive parallelization , 1990, ICS '90.

[87]  Perry R. Cook,et al.  ChucK: A Concurrent, On-the-fly, Audio Programming Language , 2003, ICMC.

[88]  Martin Rinard,et al.  Reasoning about Relaxed Programs , 2011 .

[89]  Peter G. Harrison,et al.  Parallel Programming Using Skeleton Functions , 1993, PARLE.

[90]  Joe D. Warren,et al.  The program dependence graph and its use in optimization , 1987, TOPL.

[91]  Vivek Sarkar,et al.  X10: an object-oriented approach to non-uniform cluster computing , 2005, OOPSLA '05.

[92]  Alan Mycroft,et al.  A lightweight in-place implementation for software thread-level speculation , 2009, SPAA '09.

[93]  G.E. Moore,et al.  Cramming More Components Onto Integrated Circuits , 1998, Proceedings of the IEEE.

[94]  Brian Ensink,et al.  Language and Compiler Support for Adaptive Distributed Applications , 2001 .

[95]  Easwaran Raman,et al.  Parallel-stage decoupled software pipelining , 2008, CGO '08.

[96]  Scott A. Mahlke,et al.  Uncovering hidden loop level parallelism in sequential applications , 2008, 2008 IEEE 14th International Symposium on High Performance Computer Architecture.

[97]  Lixia Liu,et al.  Improving parallelism and locality with asynchronous algorithms , 2010, PPoPP '10.

[98]  Ranjit Jhala,et al.  Deterministic parallelism via liquid effects , 2012, PLDI '12.

[99]  David R. Butenhof Programming with POSIX threads , 1993 .

[100]  Suresh Jagannathan,et al.  Safe futures for Java , 2005, OOPSLA '05.

[101]  John H. Reppy Concurrent ML: Design, Application and Semantics , 1993, Functional Programming, Concurrency, Simulation and Automated Reasoning.

[102]  Dan Grossman,et al.  Type-safe multithreading in cyclone , 2003, TLDI '03.

[103]  Amer Diwan,et al.  SUIF Explorer: an interactive and interprocedural parallelizer , 1999, PPoPP '99.

[104]  Don Coppersmith,et al.  The Data Encryption Standard (DES) and its strength against attacks , 1994, IBM J. Res. Dev..

[105]  Teresa H. Y. Meng,et al.  Merge: a programming model for heterogeneous multi-core systems , 2008, ASPLOS.

[106]  Guilherme Ottoni,et al.  Automatic thread extraction with decoupled software pipelining , 2005, 38th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'05).

[107]  Martin Rinard,et al.  The design, implementation and evaluation of Jade: a portable, implicitly parallel programming language , 1994 .

[108]  Luis Ceze,et al.  Implicit parallelism with ordered transactions , 2007, PPoPP.

[109]  Lakhdar Sais,et al.  ManySAT: a Parallel SAT Solver , 2009, J. Satisf. Boolean Model. Comput..

[110]  Stefano de Gironcoli,et al.  QUANTUM ESPRESSO: a modular and open-source software project for quantum simulations of materials , 2009, Journal of physics. Condensed matter : an Institute of Physics journal.

[111]  Kunle Olukotun,et al.  Transactional collection classes , 2007, PPOPP.

[112]  Eric C. R. Hehner,et al.  A Practical Theory of Programming , 1993, Texts and Monographs in Computer Science.

[113]  Simon L. Peyton Jones,et al.  Data parallel Haskell: a status report , 2007, DAMP '07.

[114]  Insung Park,et al.  Parallel programming environment for OpenMP , 2001, Sci. Program..

[115]  Serge J. Belongie,et al.  SD-VBS: The San Diego Vision Benchmark Suite , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[116]  Antonia Zhai,et al.  A scalable approach to thread-level speculation , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[117]  Yale N. Patt,et al.  Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[118]  Michiel J. L. de Hoon,et al.  Bioinformatics and Computational Biology with Biopython , 2003 .

[119]  Marc Snir,et al.  GETTING UP TO SPEED THE FUTURE OF SUPERCOMPUTING , 2004 .

[120]  Niklas Sörensson,et al.  An Extensible SAT-solver , 2003, SAT.

[121]  Satoshi Matsuoka,et al.  Physis: An implicitly parallel programming model for stencil computations on large-scale GPU-accelerated supercomputers , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[122]  John L. Henning SPEC CPU2006 benchmark descriptions , 2006, CARN.

[123]  Steve Plimpton,et al.  Fast parallel algorithms for short-range molecular dynamics , 1993 .

[124]  Giorgios Kollias,et al.  Asynchronous Iterative Algorithms , 2011, Encyclopedia of Parallel Computing.

[125]  William E. Weihl,et al.  Commutativity-based concurrency control for abstract data types , 1988, [1988] Proceedings of the Twenty-First Annual Hawaii International Conference on System Sciences. Volume II: Software track.

[126]  David A. Wood,et al.  ASR: Adaptive Selective Replication for CMP Caches , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[127]  Matthew J. Bridges,et al.  The velocity compiler: extracting efficient multicore execution from legacy sequential codes , 2008 .

[128]  William Thies,et al.  StreamIt: A Language for Streaming Applications , 2002, CC.

[129]  Sebastian Burckhardt,et al.  The design of a task parallel library , 2009, OOPSLA 2009.

[130]  Laxmikant V. Kalé,et al.  CHARM++: a portable concurrent object oriented system based on C++ , 1993, OOPSLA '93.

[131]  Mark Weiser,et al.  Program Slicing , 1981, IEEE Transactions on Software Engineering.

[132]  Samuel Williams,et al.  The Landscape of Parallel Computing Research: A View from Berkeley , 2006 .

[133]  Abhishek Udupa,et al.  ALTER: exploiting breakable dependences for parallelization , 2011, PLDI '11.

[134]  Ana Sokolova,et al.  Scalability versus semantics of concurrent FIFO queues , 2011, PODC '11.

[135]  Keshav Pingali,et al.  Exploiting the commutativity lattice , 2011, PLDI '11.

[136]  Cleve B. Moler,et al.  Numerical computing with MATLAB , 2004 .

[137]  Danny Dig A Refactoring Approach to Parallelism , 2011, IEEE Software.

[138]  Serdar Tasiran,et al.  A calculus of atomic actions , 2009, POPL '09.

[139]  Matteo Frigo,et al.  Reducers and other Cilk++ hyperobjects , 2009, SPAA '09.

[140]  Josep Torrellas,et al.  Speculative synchronization: applying thread-level speculation to explicitly parallel applications , 2002, ASPLOS X.

[141]  Thorsten Joachims,et al.  Making large scale SVM learning practical , 1998 .

[142]  Lei Liu,et al.  Safe parallel programming using dynamic dependence hints , 2011, OOPSLA '11.

[143]  Kunle Olukotun,et al.  STAMP: Stanford Transactional Applications for Multi-Processing , 2008, 2008 IEEE International Symposium on Workload Characterization.

[144]  Lawrence Rauchwerger,et al.  The LRPD test: speculative run-time parallelization of loops with privatization and reduction parallelization , 1995, PLDI '95.

[145]  Patrick Th. Eugster,et al.  Ribbons: a partially shared memory programming model , 2011, OOPSLA '11.

[146]  G. Ramalingam,et al.  Safe programmable speculative parallelism , 2010, PLDI '10.

[147]  Sanjay J. Patel,et al.  Implicitly Parallel Programming Models for Thousand-Core Microprocessors , 2007, 2007 44th ACM/IEEE Design Automation Conference.

[148]  James Reinders,et al.  Intel® threading building blocks , 2008 .

[149]  Adam Welc,et al.  Design and implementation of transactional constructs for C/C++ , 2008, OOPSLA '08.

[150]  Joel H. Saltz,et al.  Run-time and compile-time support for adaptive irregular problems , 1994, Proceedings of Supercomputing '94.

[151]  Ayal Zaks,et al.  Speculative separation for privatization and reductions , 2012, PLDI.

[152]  Mendel Rosenblum,et al.  Streamware: programming general-purpose multicore processors using streams , 2008, ASPLOS.

[153]  Cherri M. Pancake,et al.  What users need in parallel tool support: survey results and analysis , 1994, Proceedings of IEEE Scalable High Performance Computing Conference.

[154]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).