API compilation for image hardware accelerators

We present an API-based compilation strategy to optimize image applications, developed using a high-level image processing library, onto three different image processing hardware accelerators. We demonstrate that such a strategy is profitable for both development cost and overall performance, especially as it takes advantage of optimization opportunities across library calls otherwise beyond reach. The library API provides the semantics of the image computations. The three image accelerator targets are quite distinct: the first one uses a vector architecture; the second one presents an SIMD architecture; the last one runs both on GPGPU and multicores through OpenCL. We have adapted standard compilation techniques to perform these compilation and code generation tasks automatically. Our strategy is implemented in PIPS, a source-to-source compiler which greatly reduces the development cost as standard phases are reused and parameterized. We carried out experiments with applications on hardware functional simulators and GPUs. Our contributions include: (1) a general low-cost compilation strategy for image processing applications, based on the semantics provided by library calls, which improves locality by an order of magnitude; (2) specific heuristics to minimize execution time on the target accelerators; (3) numerous experiments that show the effectiveness of our strategies. We also discuss the conditions required to extend this approach to other application domains.

[1]  Joseph A. Fisher,et al.  Trace Scheduling: A Technique for Global Microcode Compaction , 1981, IEEE Transactions on Computers.

[2]  Pedro C. Diniz,et al.  Compilation Techniques for Reconfigurable Architectures , 2008 .

[3]  Gordon L. Kindlmann,et al.  Diderot: a parallel DSL for image analysis and visualization , 2012, PLDI.

[4]  Stephen A. Cook,et al.  Storage Requirements for Deterministic Polynomial Time Recognizable Languages , 1976, J. Comput. Syst. Sci..

[5]  Chris R. Jesshope,et al.  Parallel Computers 2: Architecture, Programming and Algorithms , 1981 .

[6]  Pierre Soille,et al.  Morphological Image Analysis: Principles and Applications , 2003 .

[7]  Frédo Durand,et al.  Decoupling algorithms from schedules for easy optimization of image processing pipelines , 2012, ACM Trans. Graph..

[8]  Christophe Clienti,et al.  A system on chip dedicated to pipeline neighborhood processing for Mathematical Morphology , 2008, 2008 16th European Signal Processing Conference.

[9]  Fabien Coelho,et al.  Using algebraic transformations to optimize expression evaluation in scientific code , 1998, Proceedings. 1998 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.98EX192).

[10]  Myung Hoon Sunwoo,et al.  VSIP : Implementation of Video Specific Instruction-set Processor , 2006, APCCAS 2006 - 2006 IEEE Asia Pacific Conference on Circuits and Systems.

[11]  Fabien Coelho,et al.  Compiling for a Heterogeneous Vector Image Processor , 2011 .

[12]  Pierre Boulet,et al.  Array-OL with delays, a domain specific specification language for multidimensional intensive signal processing , 2010, Multidimens. Syst. Signal Process..

[13]  Alfred V. Aho,et al.  Compilers: Principles, Techniques, and Tools (2nd Edition) , 2006 .

[14]  Joshua S. Auerbach,et al.  Lime: a Java-compatible and synthesizable language for heterogeneous architectures , 2010, OOPSLA.

[15]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[16]  John L. Bruno,et al.  Code Generation for a One-Register Machine , 1976, J. ACM.

[17]  Thomas Rauber Optimal evaluation of vector expression trees , 1990, Proceedings of the 5th Jerusalem Conference on Information Technology, 1990. 'Next Decade in Information Technology'.

[18]  Alfred V. Aho,et al.  Code Generation for Expressions with Common Subexpressions , 1977, J. ACM.

[19]  Mary W. Hall,et al.  Detecting Coarse - Grain Parallelism Using an Interprocedural Parallelizing Compiler , 1995, Proceedings of the IEEE/ACM SC95 Conference.

[20]  Michael Wolfe,et al.  High performance compilers for parallel computing , 1995 .

[21]  Rajiv Gupta A Code Motion Framework for Global Instruction Scheduling , 1998, CC.

[22]  Fabrice Lemonnier,et al.  Definition and SIMD Implementation of a Multi-Processing Architecture Approach on FPGA , 2008, 2008 Design, Automation and Test in Europe.

[23]  Stephen A. Cook,et al.  Storage requirements for deterministic / polynomial time recognizable languages , 1974, STOC '74.

[24]  Daniel P. Campbell Standardization of Object Oriented Extensions to Vector Signal and Image Processing Library (VSIPL) , 2006 .