Commutative set: a language extension for implicit parallel programming

Sequential programming models express a total program order, of which a partial order must be respected. This inhibits parallelizing tools from extracting scalable performance. Programmer written semantic commutativity assertions provide a natural way of relaxing this partial order, thereby exposing parallelism implicitly in a program. Existing implicit parallel programming models based on semantic commutativity either require additional programming extensions, or have limited expressiveness. This paper presents a generalized semantic commutativity based programming extension, called Commutative Set (COMMSET), and associated compiler technology that enables multiple forms of parallelism. COMMSET expressions are syntactically succinct and enable the programmer to specify commutativity relations between groups of arbitrary structured code blocks. Using only this construct, serializing constraints that inhibit parallelization can be relaxed, independent of any particular parallelization strategy or concurrency control mechanism. COMMSET enables well performing parallelizations in cases where they were inapplicable or non-performing before. By extending eight sequential programs with only 8 annotations per program on average, COMMSET and the associated compiler technology produced a geomean speedup of 5.7x on eight cores compared to 1.5x for the best non-COMMSET parallelization.

[1]  Joe D. Warren,et al.  The program dependence graph and its use in optimization , 1987, TOPL.

[2]  David A. Padua,et al.  Experience in the Automatic Parallelization of Four Perfect-Benchmark Programs , 1991, LCPC.

[3]  David R. Butenhof Programming with POSIX threads , 1993 .

[4]  Martin Rinard,et al.  The design, implementation and evaluation of Jade: a portable, implicitly parallel programming language , 1994 .

[5]  Martin C. Rinard,et al.  Commutativity analysis: a new analysis framework for parallelizing compilers , 1996, PLDI '96.

[6]  Martin C. Carlisle,et al.  Olden: parallelizing programs with dynamic data structures on distributed-memory machines , 1996 .

[7]  Guy E. Blelloch,et al.  A provable time and space efficient implementation of NESL , 1996, ICFP '96.

[8]  David A. Padua,et al.  Beyond Arrays - A Container-Centric Approach for Parallelization of Real-World Symbolic Applications , 1998, LCPC.

[9]  Wendong Hu,et al.  NetBench: a benchmarking suite for network processors , 2001, IEEE/ACM International Conference on Computer Aided Design. ICCAD 2001. IEEE/ACM Digest of Technical Papers (Cat. No.01CH37281).

[10]  Ken Kennedy,et al.  Optimizing Compilers for Modern Architectures: A Dependence-based Approach , 2001 .

[11]  Vikram S. Adve,et al.  LLVM: a compilation framework for lifelong program analysis & transformation , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[12]  Guilherme Ottoni,et al.  Automatic thread extraction with decoupled software pipelining , 2005, 38th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'05).

[13]  Berkin Özisikyilmaz,et al.  MineBench: A Benchmark Suite for Data Mining Workloads , 2006, 2006 IEEE International Symposium on Workload Characterization.

[14]  John L. Henning SPEC CPU2006 benchmark descriptions , 2006, CARN.

[15]  Kunle Olukotun,et al.  Transactional collection classes , 2007, PPOPP.

[16]  Yun Zhang,et al.  Revisiting the Sequential Programming Model for Multi-Core , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[17]  Keshav Pingali,et al.  Optimistic parallelism requires abstractions , 2007, PLDI '07.

[18]  Satnam Singh,et al.  Feedback directed implicit parallelism , 2007, ICFP '07.

[19]  Luis Ceze,et al.  Implicit parallelism with ordered transactions , 2007, PPoPP.

[20]  Sanjay J. Patel,et al.  Implicitly Parallel Programming Models for Thousand-Core Microprocessors , 2007, 2007 44th ACM/IEEE Design Automation Conference.

[21]  Scott A. Mahlke,et al.  Uncovering hidden loop level parallelism in sequential applications , 2008, 2008 IEEE 14th International Symposium on High Performance Computer Architecture.

[22]  Easwaran Raman,et al.  Parallel-stage decoupled software pipelining , 2008, CGO '08.

[23]  Matthew J. Bridges,et al.  The velocity compiler: extracting efficient multicore execution from legacy sequential codes , 2008 .

[24]  Guilherme Ottoni,et al.  Global instruction scheduling for multi-threaded architectures , 2008 .

[25]  Hsien-Hsin S. Lee,et al.  Kicking the tires of software transactional memory: why the going gets tough , 2008, SPAA '08.

[26]  Kunle Olukotun,et al.  STAMP: Stanford Transactional Applications for Multi-Processing , 2008, 2008 IEEE International Symposium on Workload Characterization.

[27]  Vikram S. Adve,et al.  A type and effect system for deterministic parallel Java , 2009, OOPSLA '09.

[28]  Alejandro Duran,et al.  The Design of OpenMP Tasks , 2009, IEEE Transactions on Parallel and Distributed Systems.

[29]  Nathan Clark,et al.  Commutativity analysis for software parallelization: letting program transformations see the big picture , 2009, ASPLOS.

[30]  Maurice Herlihy,et al.  Coarse-grained transactions , 2010, POPL '10.

[31]  Koen De Bosschere,et al.  The paralax infrastructure: Automatic parallelization with a helping hand , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[32]  Jan Smans,et al.  Deadlock-Free Channels and Locks , 2010, ESOP.