Parallel associative reductions in Halide

Halide is a domain-specific language for fast image processing that separates pipelines into the algorithm, which defines what values are computed, and the schedule, which defines how they are computed. Changes to the schedule are guaranteed to not change the results. While Halide supports parallelizing and vectorizing naturally data-parallel operations, it does not support the same scheduling for reductions. Instead, the programmer must create data parallelism by manually factoring reductions into multiple stages. This manipulation of the algorithm can introduce bugs, impairs readability and portability, and makes it impossible for automatic scheduling methods to parallelize reductions. We describe a new Halide scheduling primitive rfactor which moves this factoring transformation into the schedule, as well as a novel synthesis-based technique that takes serial Halide reductions and synthesizes an equivalent binary associative reduction operator and its identity. This enables us to automatically replace the original pipeline stage with a pair of stages which first compute partial results over slices of the reduction domain, and then combine them. Our technique permits parallelization and vectorization of Halide algorithms which previously required manipulating both the algorithm and schedule.

[1]  Vikram S. Adve,et al.  LLVM: a compilation framework for lifelong program analysis & transformation , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[2]  Paul Feautrier,et al.  Dataflow analysis of array and scalar references , 1991, International Journal of Parallel Programming.

[3]  Jonathan T. Barron,et al.  Burst photography for high dynamic range and low-light imaging on mobile cameras , 2016, ACM Trans. Graph..

[4]  Richard Kenner,et al.  Eliminating branches using a superoptimizer and the GNU C compiler , 1992, PLDI '92.

[5]  Frédo Durand,et al.  Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines , 2013, PLDI 2013.

[6]  Alexander Aiken,et al.  Stochastic superoptimization , 2012, ASPLOS '13.

[7]  Armando Solar-Lezama,et al.  MSL: A Synthesis Enabled Language for Distributed Implementations , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[8]  Emina Torlak,et al.  Growing solver-aided languages with rosette , 2013, Onward!.

[9]  Jonathan Ragan-Kelley,et al.  Automatically scheduling halide image processing pipelines , 2016, ACM Trans. Graph..

[10]  L. Dagum,et al.  OpenMP: an industry standard API for shared-memory programming , 1998 .

[11]  Albert Cohen,et al.  Reduction drawing: Language constructs and polyhedral compilation for reductions on GPUs , 2016, 2016 International Conference on Parallel Architecture and Compilation Techniques (PACT).

[12]  Aws Albarghouthi,et al.  MapReduce program synthesis , 2016, PLDI.

[13]  Chun Chen,et al.  Loop Transformation Recipes for Code Generation and Auto-Tuning , 2009, LCPC.

[14]  Wei-Ngan Chin,et al.  Deriving efficient parallel programs for complex recurrences , 1997, PASCO '97.

[15]  Dinakar Dhurjati,et al.  Scaling up Superoptimization , 2016 .

[16]  Nikolaj Bjørner,et al.  Z3: An Efficient SMT Solver , 2008, TACAS.

[17]  Henry Massalin Superoptimizer: a look at the smallest program , 1987, ASPLOS 1987.

[18]  Bradley C. Kuszmaul,et al.  Cilk: an efficient multithreaded runtime system , 1995, PPOPP '95.

[19]  Akimasa Morihata,et al.  Automatic inversion generates divide-and-conquer parallel programs , 2007, PLDI '07.

[20]  David Parello,et al.  Semi-Automatic Composition of Loop Transformations for Deep Parallelism and Memory Hierarchies , 2006, International Journal of Parallel Programming.

[21]  John Regehr,et al.  Provably correct peephole optimizations with alive , 2015, PLDI.

[22]  François Irigoin,et al.  Supernode partitioning , 1988, POPL '88.

[23]  Armando Solar-Lezama,et al.  Program synthesis by sketching , 2008 .