Auto-tuning for large-scale image processing by dynamic analysis method on multicore platforms

This paper describes a general-purpose method of improving execution performance of the in-memory data, particularly in the case of large-scale image processing on different multicore platforms. To process large-scale arrays, the method of tiling is widely used to achieve high performance. However, frequently accessing the memory system by multithreads is bound to cause system bottleneck. Our optimisation strategies are automatic thread scheduling and data/task partitioning. Those methods that attempt to take advantage of spatial and temporal locality can reduce memory traffic remarkably. According to the hardware configurations, a scheduler automatically partitions the images into tiled blocks of pre-determined size. Then it fuses all the operations for the same blocks to reduce the rate of cache miss. The parallel task execution is more effective than other traditional parallel libraries, such as openMP. Moreover, the optimisation on space-filling curves that optimises the locality of neighbouring tiled blocks can also contribute to the fast memory access.

[1]  Uday Bondhugula,et al.  Effective automatic parallelization of stencil computations , 2007, PLDI '07.

[2]  Michael Allen,et al.  Parallel programming: techniques and applications using networked workstations and parallel computers , 1998 .

[3]  Jeffrey Overbey,et al.  ForOpenCL: transformations exploiting array syntax in Fortran for accelerator programming , 2011, Int. J. Comput. Sci. Eng..

[4]  P. Sadayappan,et al.  High-performance code generation for stencil computations on GPU architectures , 2012, ICS '12.

[5]  Christoph W. Kessler,et al.  Automatic parallelization of simulation code for equation-based models with software pipelining and measurements on three platforms , 2009, CARN.

[6]  R. Balasubramonian,et al.  Memory hierarchy reconfiguration for energy and performance in general-purpose processor architectures , 2000, Proceedings 33rd Annual IEEE/ACM International Symposium on Microarchitecture. MICRO-33 2000.

[7]  Samuel Williams,et al.  An auto-tuning framework for parallel multicore stencil computations , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[8]  Frank Dehne,et al.  Communication issues in scalable parallel computing , 2009 .

[9]  Helmar Burkhart,et al.  PATUS: A Code Generation and Autotuning Framework for Parallel Iterative Stencil Computations on Modern Microarchitectures , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[10]  Rafael S. Parpinelli,et al.  Population-based harmony search using GPU applied to protein structure prediction , 2014, Int. J. Comput. Sci. Eng..

[11]  Samuel Williams,et al.  Optimization and Performance Modeling of Stencil Computations on Modern Microprocessors , 2007, SIAM Rev..

[12]  Oge Marques,et al.  Practical Image and Video Processing Using MATLAB®: Marques/Practical Image Processing , 2011 .

[13]  David A. Padua,et al.  Programming for parallelism and locality with hierarchically tiled arrays , 2006, PPoPP '06.

[14]  Chi-Bang Kuan,et al.  Automated Empirical Optimization , 2011, Encyclopedia of Parallel Computing.

[15]  David R. Liu,et al.  Potent Delivery of Functional Proteins into Mammalian Cells in Vitro and in Vivo Using a Supercharged Protein , 2010, ACS chemical biology.

[16]  Scott E. Umbaugh,et al.  Digital image processing and analysis : human and computer vision applications with CVIPtools , 2011 .

[17]  Steffen Beich,et al.  Digital Video And Hdtv Algorithms And Interfaces , 2016 .

[18]  Abdellatif Mtibaa,et al.  Temporal partitioning of data flow graphs for reconfigurable architectures , 2014, Int. J. Comput. Sci. Eng..

[19]  Kiyoharu Aizawa,et al.  Image Processing Technologies : Algorithms, Sensors, and Applications , 2004 .

[20]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[21]  Kai Li,et al.  Thread scheduling for cache locality , 1996, ASPLOS VII.

[22]  Gregory G. Slabaugh,et al.  Multicore Image Processing with OpenMP [Applications Corner] , 2010, IEEE Signal Processing Magazine.

[23]  Frédo Durand,et al.  Decoupling algorithms from schedules for easy optimization of image processing pipelines , 2012, ACM Trans. Graph..

[24]  Zhiyuan Li,et al.  New tiling techniques to improve cache temporal locality , 1999, PLDI '99.

[25]  Frank Mueller,et al.  Auto-generation and auto-tuning of 3D stencil codes on GPU clusters , 2012, CGO '12.

[26]  Kevin Skadron,et al.  A performance study of general-purpose applications on graphics processors using CUDA , 2008, J. Parallel Distributed Comput..

[27]  Robert J. Fowler,et al.  Modeling memory concurrency for multi-socket multi-core systems , 2010, 2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS).

[28]  Katherine Yelick,et al.  OSKI: A library of automatically tuned sparse matrix kernels , 2005 .

[29]  Kevin Skadron,et al.  A Performance Study for Iterative Stencil Loops on GPUs with Ghost Zone Optimizations , 2011, International Journal of Parallel Programming.

[30]  Ruay-Shiung Chang,et al.  Simplifying MapReduce Data Processing , 2011, 2011 Fourth IEEE International Conference on Utility and Cloud Computing.