Synthesis of divide and conquer parallelism for loops

Divide-and-conquer is a common parallel programming skeleton supported by many cross-platform multithreaded libraries, and most commonly used by programmers for parallelization. The challenges of producing (manually or automatically) a correct divide-and-conquer parallel program from a given sequential code are two-fold: (1) assuming that a good solution exists where individual worker threads execute a code identical to the sequential one, the programmer has to provide the extra code for dividing the tasks and combining the partial results (i.e. joins), and (2) the sequential code may not be suitable for divide-and-conquer parallelization as is, and may need to be modified to become a part of a good solution. We address both challenges in this paper. We present an automated synthesis technique to synthesize correct joins and an algorithm for modifying the sequential code to make it suitable for parallelization when necessary. This paper focuses on class of loops that traverse a read-only collection and compute a scalar function over that collection. We present theoretical results for when the necessary modifications to sequential code are possible, theoretical guarantees for the algorithmic solutions presented here, and experimental evaluation of the approach's success in practice and the quality of the produced parallel programs.

[1]  Claude Marché,et al.  Normalized Rewriting: An Alternative to Rewriting Modulo a Set of Equations , 1996, J. Symb. Comput..

[2]  Akimasa Morihata A short cut to parallelization theorems , 2013, ICFP.

[3]  Richard S. Bird,et al.  An introduction to the theory of lists , 1987 .

[4]  Azadeh Farzan,et al.  Automated Synthesis of Divide and Conquer Parallelism , 2017, ArXiv.

[5]  Todd Mytkowicz,et al.  Parallelizing user-defined aggregations using symbolic execution , 2015, SOSP.

[6]  Rajeev Alur,et al.  Syntax-guided synthesis , 2013, 2013 Formal Methods in Computer-Aided Design.

[7]  Sergei Gorlatch,et al.  Parallelizing functional programs by generalization , 1997, Journal of Functional Programming.

[8]  Akimasa Morihata,et al.  Automatic Parallelization of Recursive Functions Using Quantifier Elimination , 2010, FLOPS.

[9]  Eyal Kushilevitz,et al.  Communication Complexity , 1997, Adv. Comput..

[10]  Monica S. Lam,et al.  The SUIF Compiler System: a Parallelizing and Optimizing Research Compiler , 1994 .

[11]  George C. Necula,et al.  CIL: Intermediate Language and Tools for Analysis and Transformation of C Programs , 2002, CC.

[12]  Aws Albarghouthi,et al.  MapReduce program synthesis , 2016, PLDI.

[13]  Keshav Pingali,et al.  The tao of parallelism in algorithms , 2011, PLDI '11.

[14]  Michaël Rusinowitch,et al.  Any ground associative-commutative theory has a finite canonical system , 1996, Journal of Automated Reasoning.

[15]  Richard Kelsey,et al.  A correspondence between continuation passing style and static single assignment form , 1995, IR '95.

[16]  Yosi Ben-Asher,et al.  Parallel Solutions of Simple Indexed Recurrence Equations , 2001, IEEE Trans. Parallel Distributed Syst..

[17]  Allan L. Fisher,et al.  Parallelizing complex scans and reductions , 1994, PLDI '94.

[18]  Emina Torlak,et al.  Growing solver-aided languages with rosette , 2013, Onward!.

[19]  Alvin Cheung,et al.  Verified lifting of stencil computations , 2016, PLDI.

[20]  Armando Solar-Lezama,et al.  Sketching concurrent data structures , 2008, PLDI '08.

[21]  Leonid Ryzhyk,et al.  Regression-free Synthesis for Concurrency , 2014, CAV.

[22]  Akimasa Morihata,et al.  Automatic inversion generates divide-and-conquer parallel programs , 2007, PLDI '07.

[23]  Yunheung Paek,et al.  Parallel Programming with Polaris , 1996, Computer.

[24]  Sergei Gorlatch,et al.  Extracting and Implementing List Homomorphisms in Parallel Program Development , 1999, Sci. Comput. Program..

[25]  Claude Marché,et al.  Termination of Associative-Commutative Rewriting by Dependency Pairs , 1998, RTA.

[26]  Jeremy Gibbons The Third Homomorphism Theorem , 1996, J. Funct. Program..

[27]  Cédric Bastoul,et al.  Efficient code generation for automatic parallelization and optimization , 2003, Second International Symposium on Parallel and Distributed Computing, 2003. Proceedings..

[28]  Guy E. Blelloch,et al.  Prefix sums and their applications , 1990 .

[29]  Sergei Gorlatch,et al.  Systematic Extraction and Implementation of Divide-and-Conquer Parallelism , 1996, PLILP.

[30]  K. Rustan M. Leino,et al.  Dafny: An Automatic Program Verifier for Functional Correctness , 2010, LPAR.

[31]  Hideya Iwasaki,et al.  Automatic parallelization via matrix multiplication , 2011, PLDI '11.

[32]  Cédric Bastoul,et al.  Code generation in the polyhedral model is easier than you think , 2004, Proceedings. 13th International Conference on Parallel Architecture and Compilation Techniques, 2004. PACT 2004..

[33]  Akimasa Morihata,et al.  The third homomorphism theorem on trees: downward & upward lead to divide-and-conquer , 2009, POPL '09.

[34]  Amir Pnueli,et al.  Translation and Run-Time Validation of Optimized Code , 2002, RV@FLoC.

[35]  Chau-Wen Tseng,et al.  A comparison of parallelization techniques for irregular reductions , 2001, Proceedings 15th International Parallel and Distributed Processing Symposium. IPDPS 2001.

[36]  Andrew W. Appel,et al.  SSA is functional programming , 1998, SIGP.

[37]  Y. Lafont Word Problem , 2019, 99 Variations on a Proof.

[38]  Jennifer Widom,et al.  Models and issues in data stream systems , 2002, PODS.

[39]  Sanjit A. Seshia,et al.  Sketching stencils , 2007, PLDI '07.

[40]  James Reinders,et al.  Intel® threading building blocks , 2008 .

[41]  Sanjeev Saxena,et al.  On Parallel Prefix Computation , 1994, Parallel Process. Lett..

[42]  Maaz Bin Safeer Ahmad,et al.  Gradual synthesis for static parallelization of single-pass array-processing programs , 2017, PLDI.

[43]  A. Razborov Communication Complexity , 2011 .

[44]  David F. Bacon,et al.  Compiler transformations for high-performance computing , 1994, CSUR.

[45]  David A. Padua,et al.  Advanced compiler optimizations for supercomputers , 1986, CACM.

[46]  Alexandru Nicolau,et al.  A Geometric Approach for Partitioning N-Dimensional Non-rectangular Iteration Spaces , 2004, LCPC.

[47]  Margaret Martonosi,et al.  Characterizing and improving the performance of Intel Threading Building Blocks , 2008, 2008 IEEE International Symposium on Workload Characterization.

[48]  Albert Cohen,et al.  Polyhedral Code Generation in the Real World , 2006, CC.

[49]  Noga Alon,et al.  The space complexity of approximating the frequency moments , 1996, STOC '96.

[50]  Wei-Ngan Chin,et al.  Parallelization via context preservation , 1998, Proceedings of the 1998 International Conference on Computer Languages (Cat. No.98CB36225).

[51]  Manu Sridharan,et al.  Translating imperative code to MapReduce , 2014, OOPSLA 2014.

[52]  W. Daniel Hillis,et al.  Data parallel algorithms , 1986, CACM.