Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines

Image processing pipelines combine the challenges of stencil computations and stream programs. They are composed of large graphs of different stencil stages, as well as complex reductions, and stages with global or data-dependent access patterns. Because of their complex structure, the performance difference between a naive implementation of a pipeline and an optimized one is often an order of magnitude. Efficient implementations require optimization of both parallelism and locality, but due to the nature of stencils, there is a fundamental tension between parallelism, locality, and introducing redundant recomputation of shared values. We present a systematic model of the tradeoff space fundamental to stencil pipelines, a schedule representation which describes concrete points in this space for each stage in an image processing pipeline, and an optimizing compiler for the Halide image processing language that synthesizes high performance implementations from a Halide algorithm and a schedule. Combining this compiler with stochastic search over the space of schedules enables terse, composable programs to achieve state-of-the-art performance on a wide range of real image processing pipelines, and across different hardware architectures, including multicores with SIMD, and heterogeneous CPU+GPU execution. From simple Halide programs written in a few hours, we demonstrate performance up to 5x faster than hand-tuned C, intrinsics, and CUDA implementations optimized by experts over weeks or months, for image processing applications beyond the reach of past automatic compilers.

[1]  Albert Benveniste,et al.  Signal-A data flow-oriented language for signal processing , 1986, IEEE Trans. Acoust. Speech Signal Process..

[2]  H. T. Kung,et al.  Automatic Mapping Of Large Signal Processing Systems To A Parallel Machine , 1991, Optics & Photonics.

[3]  P.-S. Tseng A parallelizing compiler for distributed memory parallel computers , 1989, PLDI 1989.

[4]  Michael A. Shantzis A model for efficient and flexible image computing , 1994, SIGGRAPH.

[5]  Conal Elliott,et al.  Functional Image Synthesis , 2001 .

[6]  Henry Hoffmann,et al.  A stream compiler for communication-exposed architectures , 2002, ASPLOS X.

[7]  William Thies,et al.  StreamIt: A Language for Streaming Applications , 2002, CC.

[8]  Pat Hanrahan,et al.  Brook for GPUs: stream computing on graphics hardware , 2004, ACM Trans. Graph..

[9]  Franz Franchetti,et al.  SPIRAL: Code Generation for DSP Transforms , 2005, Proceedings of the IEEE.

[10]  Volker Strumpen,et al.  Cache oblivious stencil computations , 2005, ICS '05.

[11]  P. Hanrahan,et al.  Sequoia: Programming the Memory Hierarchy , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[12]  Uday Bondhugula,et al.  Effective automatic parallelization of stencil computations , 2007, PLDI '07.

[13]  Sylvain Paris,et al.  Real-time edge-aware image processing with the bilateral grid , 2007, ACM Trans. Graph..

[14]  Frédo Durand,et al.  Bilateral Filtering: Theory and Applications: Series: Foundations and Trends® in Computer Graphics and Vision , 2009 .

[15]  Paul H. J. Kelly,et al.  High-performance SIMT code generation in an active visual effects library , 2009, CF '09.

[16]  Alan Edelman,et al.  PetaBricks: a language and compiler for algorithmic choice , 2009, PLDI '09.

[17]  Pierre Kornprobst,et al.  Bilateral Filtering , 2009 .

[18]  Kevin Skadron,et al.  A Performance Study for Iterative Stencil Loops on GPUs with Ghost Zone Optimizations , 2011, International Journal of Parallel Programming.

[19]  Samuel Williams,et al.  An auto-tuning framework for parallel multicore stencil computations , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[20]  Pradeep Dubey,et al.  3.5-D Blocking Optimization for Stencil Computations on Modern CPUs and GPUs , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[21]  Marc Levoy,et al.  The Frankencamera: an experimental platform for computational photography , 2010, SIGGRAPH 2010.

[22]  Frédo Durand,et al.  Fast and Robust Pyramid-based Image Processing , 2011 .

[23]  Bradley C. Kuszmaul,et al.  The pochoir stencil compiler , 2011, SPAA '11.

[24]  Jan Kautz,et al.  Local Laplacian filters: edge-aware image processing with a Laplacian pyramid , 2011, ACM Trans. Graph..

[25]  P. Sadayappan,et al.  High-performance code generation for stencil computations on GPU architectures , 2012, ICS '12.

[26]  Frédo Durand,et al.  Decoupling algorithms from schedules for easy optimization of image processing pipelines , 2012, ACM Trans. Graph..

[27]  Xing Zhou,et al.  Hierarchical overlapped tiling , 2012, CGO '12.

[28]  Marc Levoy,et al.  The Frankencamera: an experimental platform for computational photography , 2010, ACM Trans. Graph..