Modular divide-and-conquer parallelization of nested loops

We propose a methodology for automatic generation of divide-and-conquer parallel implementations of sequential nested loops. We focus on a class of loops that traverse read-only multidimensional collections (lists or arrays) and compute a function over these collections. Our approach is modular, in that, the inner loop nest is abstracted away to produce a simpler loop nest for parallelization. The summarized version of the loop nest is then parallelized. The main challenge addressed by this paper is that to perform the code transformations necessary in each step, the loop nest may have to be augmented (automatically) with extra computation to make possible the abstraction and/or the parallelization tasks. We present theoretical results to justify the correctness of our modular approach, and algorithmic solutions for automation. Experimental results demonstrate that our approach can parallelize highly non-trivial loop nests efficiently.

[1]  David F. Bacon,et al.  Compiler transformations for high-performance computing , 1994, CSUR.

[2]  Michaël Rusinowitch,et al.  Any ground associative-commutative theory has a finite canonical system , 1996, Journal of Automated Reasoning.

[3]  Sergei Gorlatch,et al.  Parallelizing functional programs by generalization , 1999 .

[4]  Margaret Martonosi,et al.  Characterizing and improving the performance of Intel Threading Building Blocks , 2008, 2008 IEEE International Symposium on Workload Characterization.

[5]  Chuck Pheatt,et al.  Intel® threading building blocks , 2008 .

[6]  Sharad Malik,et al.  Retargetable Very Long Instuction Word Compiler Framework for Digital Signal Processors. , 2002 .

[7]  Yosi Ben-Asher,et al.  Parallel Solutions of Simple Indexed Recurrence Equations , 2001, IEEE Trans. Parallel Distributed Syst..

[8]  Azadeh Farzan,et al.  Modular Synthesis of Divide-and-Conquer Parallelism for Nested Loops (Extended Version) , 2019, ArXiv.

[9]  L. Dagum,et al.  OpenMP: an industry standard API for shared-memory programming , 1998 .

[10]  Cédric Bastoul,et al.  Code generation in the polyhedral model is easier than you think , 2004, Proceedings. 13th International Conference on Parallel Architecture and Compilation Techniques, 2004. PACT 2004..

[11]  Rajeev Alur,et al.  Syntax-guided synthesis , 2013, 2013 Formal Methods in Computer-Aided Design.

[12]  Albert Cohen,et al.  Polyhedral Code Generation in the Real World , 2006, CC.

[13]  Ron Shamir,et al.  Faster subtree isomorphism , 1997, Proceedings of the Fifth Israeli Symposium on Theory of Computing and Systems.

[14]  Jan Gustafsson,et al.  Automatic Derivation of Loop Bounds and Infeasible Paths for WCET Analysis Using Abstract Execution , 2006, 2006 27th IEEE International Real-Time Systems Symposium (RTSS'06).

[15]  Guy E. Blelloch,et al.  Internally deterministic parallel algorithms can be fast , 2012, PPoPP '12.

[16]  Todd Mytkowicz,et al.  Parallelizing user-defined aggregations using symbolic execution , 2015, SOSP.

[17]  Sergei Gorlatch,et al.  Systematic Extraction and Implementation of Divide-and-Conquer Parallelism , 1996, PLILP.

[18]  K. Rustan M. Leino,et al.  Dafny: An Automatic Program Verifier for Functional Correctness , 2010, LPAR.

[19]  Azadeh Farzan,et al.  Synthesis of divide and conquer parallelism for loops , 2017, PLDI.

[20]  Priti Shankar,et al.  The Compiler Design Handbook: Optimizations and Machine Code Generation , 2002, The Compiler Design Handbook.

[21]  Alvin Cheung,et al.  Verified lifting of stencil computations , 2016, PLDI.

[22]  Akimasa Morihata,et al.  Automatic inversion generates divide-and-conquer parallel programs , 2007, PLDI '07.

[23]  Maaz Bin Safeer Ahmad,et al.  Gradual synthesis for static parallelization of single-pass array-processing programs , 2017, PLDI.

[24]  Kevin Skadron,et al.  Scalable parallel programming , 2008, 2008 IEEE Hot Chips 20 Symposium (HCS).

[25]  Guy E. Blelloch,et al.  Prefix sums and their applications , 1990 .

[26]  Chau-Wen Tseng,et al.  A comparison of parallelization techniques for irregular reductions , 2001, Proceedings 15th International Parallel and Distributed Processing Symposium. IPDPS 2001.

[27]  Yunheung Paek,et al.  Parallel Programming with Polaris , 1996, Computer.

[28]  Sergei Gorlatch,et al.  Extracting and Implementing List Homomorphisms in Parallel Program Development , 1999, Sci. Comput. Program..

[29]  Manu Sridharan,et al.  Translating imperative code to MapReduce , 2014, OOPSLA 2014.

[30]  W. Daniel Hillis,et al.  Data parallel algorithms , 1986, CACM.

[31]  Mahmut T. Kandemir,et al.  Compilation for distributed memory architectures , 2002 .

[32]  Claude Marché,et al.  Termination of Associative-Commutative Rewriting by Dependency Pairs , 1998, RTA.

[33]  Jeremy Gibbons The Third Homomorphism Theorem , 1996, J. Funct. Program..

[34]  Akimasa Morihata,et al.  Automatic Parallelization of Recursive Functions Using Quantifier Elimination , 2010, FLOPS.

[35]  Armando Solar-Lezama,et al.  Deriving divide-and-conquer dynamic programming algorithms using solver-aided transformations , 2016, OOPSLA.

[36]  Allan L. Fisher,et al.  Parallelizing complex scans and reductions , 1994, PLDI '94.

[37]  Daniel Cordes,et al.  A Fast and Precise Static Loop Analysis Based on Abstract Interpretation, Program Slicing and Polytope Models , 2009, 2009 International Symposium on Code Generation and Optimization.

[38]  Ute Schmid,et al.  Inductive Synthesis of Functional Programs: An Explanation Based Generalization Approach , 2006, J. Mach. Learn. Res..

[39]  Aws Albarghouthi,et al.  MapReduce program synthesis , 2016, PLDI.

[40]  Keshav Pingali,et al.  The tao of parallelism in algorithms , 2011, PLDI '11.