Modular Synthesis of Divide-and-Conquer Parallelism for Nested Loops (Extended Version)

We propose a methodology for automatic generation of divide-and-conquer parallel implementations of sequential nested loops. We focus on a class of loops that traverse read-only multidimensional collections (lists or arrays) and compute a function over these collections. Our approach is modular, in that, the inner loop nest is abstracted away to produce a simpler loop nest for parallelization. Then, the summarized version of the loop nest is parallelized. The main challenge addressed by this paper is that to perform the code transformations necessary in each step, the loop nest may have to be augmented (automatically) with extra computation to make possible the abstraction and/or the parallelization tasks. We present theoretical results to justify the correctness of our modular approach, and algorithmic solutions for automation. Experimental results demonstrate that our approach can parallelize highly non-trivial loop nests efficiently.

[1]  Priti Shankar,et al.  The Compiler Design Handbook: Optimizations and Machine Code Generation , 2002, The Compiler Design Handbook.

[2]  David F. Bacon,et al.  Compiler transformations for high-performance computing , 1994, CSUR.

[3]  David A. Padua,et al.  Advanced compiler optimizations for supercomputers , 1986, CACM.

[4]  Manu Sridharan,et al.  Translating imperative code to MapReduce , 2014, OOPSLA 2014.

[5]  Azadeh Farzan,et al.  Synthesis of divide and conquer parallelism for loops , 2017, PLDI.

[6]  W. Daniel Hillis,et al.  Data parallel algorithms , 1986, CACM.

[7]  Alexandru Nicolau,et al.  A Geometric Approach for Partitioning N-Dimensional Non-rectangular Iteration Spaces , 2004, LCPC.

[8]  Margaret Martonosi,et al.  Characterizing and improving the performance of Intel Threading Building Blocks , 2008, 2008 IEEE International Symposium on Workload Characterization.

[9]  Chuck Pheatt,et al.  Intel® threading building blocks , 2008 .

[10]  Daniel Cordes,et al.  A Fast and Precise Static Loop Analysis Based on Abstract Interpretation, Program Slicing and Polytope Models , 2009, 2009 International Symposium on Code Generation and Optimization.

[11]  Maaz Bin Safeer Ahmad,et al.  Gradual synthesis for static parallelization of single-pass array-processing programs , 2017, PLDI.

[12]  Kevin Skadron,et al.  Scalable parallel programming , 2008, 2008 IEEE Hot Chips 20 Symposium (HCS).

[13]  Akimasa Morihata,et al.  Automatic Parallelization of Recursive Functions Using Quantifier Elimination , 2010, FLOPS.

[14]  Armando Solar-Lezama,et al.  Deriving divide-and-conquer dynamic programming algorithms using solver-aided transformations , 2016, OOPSLA.

[15]  Allan L. Fisher,et al.  Parallelizing complex scans and reductions , 1994, PLDI '94.

[16]  L. Dagum,et al.  OpenMP: an industry standard API for shared-memory programming , 1998 .

[17]  Chau-Wen Tseng,et al.  A comparison of parallelization techniques for irregular reductions , 2001, Proceedings 15th International Parallel and Distributed Processing Symposium. IPDPS 2001.

[18]  Ute Schmid,et al.  Inductive Synthesis of Functional Programs: An Explanation Based Generalization Approach , 2006, J. Mach. Learn. Res..

[19]  Amir Pnueli,et al.  Translation and Run-Time Validation of Optimized Code , 2002, RV@FLoC.

[20]  Alvin Cheung,et al.  Verified lifting of stencil computations , 2016, PLDI.

[21]  Akimasa Morihata,et al.  Automatic inversion generates divide-and-conquer parallel programs , 2007, PLDI '07.

[22]  Guy E. Blelloch,et al.  Prefix sums and their applications , 1990 .

[23]  Sergei Gorlatch,et al.  Parallelizing functional programs by generalization , 1999 .

[24]  Jan Gustafsson,et al.  Automatic Derivation of Loop Bounds and Infeasible Paths for WCET Analysis Using Abstract Execution , 2006, 2006 27th IEEE International Real-Time Systems Symposium (RTSS'06).

[25]  Ron Shamir,et al.  Faster subtree isomorphism , 1997, Proceedings of the Fifth Israeli Symposium on Theory of Computing and Systems.

[26]  Aws Albarghouthi,et al.  MapReduce program synthesis , 2016, PLDI.

[27]  Keshav Pingali,et al.  The tao of parallelism in algorithms , 2011, PLDI '11.

[28]  Yosi Ben-Asher,et al.  Parallel Solutions of Simple Indexed Recurrence Equations , 2001, IEEE Trans. Parallel Distributed Syst..

[29]  Claude Marché,et al.  Termination of Associative-Commutative Rewriting by Dependency Pairs , 1998, RTA.

[30]  Jeremy Gibbons The Third Homomorphism Theorem , 1996, J. Funct. Program..

[31]  Yunheung Paek,et al.  Parallel Programming with Polaris , 1996, Computer.

[32]  Sergei Gorlatch,et al.  Extracting and Implementing List Homomorphisms in Parallel Program Development , 1999, Sci. Comput. Program..

[33]  Uday Bondhugula,et al.  A practical automatic polyhedral parallelizer and locality optimizer , 2008, PLDI '08.

[34]  Rajeev Alur,et al.  Syntax-guided synthesis , 2013, 2013 Formal Methods in Computer-Aided Design.

[35]  Albert Cohen,et al.  Polyhedral Code Generation in the Real World , 2006, CC.

[36]  Michaël Rusinowitch,et al.  Any ground associative-commutative theory has a finite canonical system , 1996, Journal of Automated Reasoning.

[37]  Francky Catthoor,et al.  Polyhedral parallel code generation for CUDA , 2013, TACO.

[38]  Guy E. Blelloch,et al.  Internally deterministic parallel algorithms can be fast , 2012, PPoPP '12.

[39]  Todd Mytkowicz,et al.  Parallelizing user-defined aggregations using symbolic execution , 2015, SOSP.

[40]  Monica S. Lam,et al.  The SUIF Compiler System: a Parallelizing and Optimizing Research Compiler , 1994 .

[41]  Hideya Iwasaki,et al.  Automatic parallelization via matrix multiplication , 2011, PLDI '11.

[42]  Cédric Bastoul,et al.  Code generation in the polyhedral model is easier than you think , 2004, Proceedings. 13th International Conference on Parallel Architecture and Compilation Techniques, 2004. PACT 2004..

[43]  Sanjit A. Seshia,et al.  Sketching stencils , 2007, PLDI '07.

[44]  Sergei Gorlatch,et al.  Systematic Extraction and Implementation of Divide-and-Conquer Parallelism , 1996, PLILP.

[45]  K. Rustan M. Leino,et al.  Dafny: An Automatic Program Verifier for Functional Correctness , 2010, LPAR.