Accelerating image convolution filtering algorithms on integrated CPU–GPU architectures

Abstract. Convolution filtering is one of the most important algorithms in image processing. It is data-intensive, especially when dealing with high-definition images. Most previous studies on accelerating convolution computation in parallel focus on the use of graphics processing units (GPUs), whereas the central processing units (CPUs) always play the role of host to manage the data buffer and control flow. However, recent CPU architectures have seen significant modifications to parallel data computing capabilities, and the trend of integrating the CPU and GPU on a single chip is on a rise. We propose an approach to accelerate convolution filtering on the heterogeneous architecture of integrated CPU–GPU. We exploit the parallel processing power of vector instructions on a CPU and make it collaboratively function with the on-chip GPU. Two task assignment methods, static and dynamic task partitioning, are proposed for CPU–GPU collaboration. We evaluate our approach with images and filters of different sizes. The experimental results demonstrate that we can achieve 146 GFLOP/s at best using a quad-core CPU and the performance is 2.5 to 4.8 times faster than that of the single-GPU version of the OpenCV library. We also obtain 90 times speedup over the single-threaded CPU version. The results demonstrate that the proposed algorithm is efficient.

[1]  Xiaoou Tang,et al.  Single Image Haze Removal Using Dark Channel Prior , 2011 .

[2]  Eduardo Cabal-Yepez,et al.  Early Experiences with OpenCL on FPGAs: Convolution Case Study , 2015, 2015 IEEE 23rd Annual International Symposium on Field-Programmable Custom Computing Machines.

[3]  Jack J. Dongarra,et al.  From CUDA to OpenCL: Towards a performance-portable solution for multi-platform GPU programming , 2012, Parallel Comput..

[4]  Jun Sun,et al.  A multiple template approach for robust tracking of fast motion target , 2016, Applied Mathematics-A Journal of Chinese Universities.

[5]  Samuel Williams,et al.  Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.

[6]  Marc Fournier Mesh filtering algorithm using an adaptive 3D convolution kernel applied to a volume-based vector distance field , 2011, Comput. Graph..

[7]  Yiqi Wu,et al.  A local start search algorithm to compute exact Hausdorff Distance for arbitrary point sets , 2017, Pattern Recognit..

[8]  Scott Hauck,et al.  Reconfigurable computing: a survey of systems and software , 2002, CSUR.

[9]  Fazhi He,et al.  Quantitative optimization of interoperability during feature-based data exchange , 2015, Integr. Comput. Aided Eng..

[10]  Thomas Ertl,et al.  Accelerating 3D convolution using graphics hardware (case study) , 1999 .

[11]  Soonhung Han,et al.  An efficient approach to directly compute the exact Hausdorff distance for 3D point sets , 2017, Integr. Comput. Aided Eng..

[12]  Jianlong Zhong,et al.  Medusa: Simplified Graph Processing on GPUs , 2014, IEEE Transactions on Parallel and Distributed Systems.

[13]  B. Bourdin Filters in topology optimization , 2001 .

[14]  Fazhi He,et al.  Using shapes correlation for active contour segmentation of uterine fibroid ultrasound images in computer-aided therapy , 2016 .

[15]  Christoph H. Lampert,et al.  Internet: www.itwm.fraunhofer.de , 2022 .

[16]  Ahmet Oguz Akyüz,et al.  High dynamic range imaging pipeline on the GPU , 2015, Journal of Real-Time Image Processing.

[17]  Xianfeng Zhao,et al.  Highly accurate real-time image steganalysis based on GPU , 2016, Journal of Real-Time Image Processing.

[18]  Fazhi He,et al.  A correlative classifiers approach based on particle filter and sample set for tracking occluded target , 2017 .

[19]  Nikil D. Dutt,et al.  Computing spike-based convolutions on GPUs , 2009, 2009 IEEE International Symposium on Circuits and Systems.

[20]  Richard Ansorge,et al.  Efficient scatter-based kernel superposition on GPU , 2015, J. Parallel Distributed Comput..

[21]  Fazhi He,et al.  A Novel Hardware/Software Partitioning Method Based on Position Disturbed Particle Swarm Optimization with Invasive Weed Optimization , 2017, Journal of Computer Science and Technology.

[22]  Martin Cadík,et al.  FFT and Convolution Performance in Image Filtering on GPU , 2006, Tenth International Conference on Information Visualisation (IV'06).

[23]  Nathan D. Cahill,et al.  A comparison of sequential and GPU-accelerated implementations of B-spline signal processing operations for 2-D and 3-D images , 2012, 2012 3rd International Conference on Image Processing Theory, Tools and Applications (IPTA).

[24]  Rynson W. H. Lau,et al.  FormResNet: Formatted Residual Learning for Image Restoration , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[25]  Chao Wu,et al.  Accelerating Astronomical Image Subtraction on Heterogeneous Processors , 2013, 2013 IEEE 9th International Conference on e-Science.

[26]  Fazhi He,et al.  Service-Oriented Feature-Based Data Exchange for Cloud-Based Design and Manufacturing , 2018, IEEE Transactions on Services Computing.

[27]  Luigi Di Benedetto,et al.  FPGA optimization of convolution-based 2D filtering processor for image processing , 2016, 2016 8th Computer Science and Electronic Engineering (CEEC).

[28]  Yuan Cheng,et al.  Supporting selective undo of string-wise operations for collaborative editing systems , 2018, Future Gener. Comput. Syst..

[29]  Xiong Shuai,et al.  Convolution Filtering Optimization Based on Linear Texture Filtering Function of GPU , 2013 .

[30]  Yi Zhou,et al.  Optimization of parallel iterated local search algorithms on graphics processing unit , 2016, The Journal of Supercomputing.

[31]  Xinxin Wang,et al.  GPU implemention of fast Gabor filters , 2010, Proceedings of 2010 IEEE International Symposium on Circuits and Systems.

[32]  Yiteng Pan,et al.  A novel region-based active contour model via local patch similarity measure for image segmentation , 2018, Multimedia Tools and Applications.

[33]  Zhang Changchun,et al.  PAR Model SAR Image Interpolation Algorithm on GPU with CUDA , 2014 .

[34]  Bernabé Linares-Barranco,et al.  Fast Pipeline 128×128 pixel spiking convolution core for event-driven vision processing in FPGAs , 2015, 2015 International Conference on Event-based Control, Communication, and Signal Processing (EBCCSP).

[35]  Andrew Lavin,et al.  Fast Algorithms for Convolutional Neural Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Rodolfo S. Lima,et al.  GPU-efficient recursive filtering and summed-area tables , 2011, SA '11.

[37]  Kang Li,et al.  Robust Visual Tracking Based on Convolutional Features with Illumination and Occlusion Handing , 2018, Journal of Computer Science and Technology.

[38]  Akila Gothandaraman,et al.  Comparing Hardware Accelerators in Scientific Applications: A Case Study , 2011, IEEE Transactions on Parallel and Distributed Systems.

[39]  Devrim Akgün,et al.  GPU accelerated training of image convolution filter weights using genetic algorithms , 2015, Appl. Soft Comput..

[40]  Yi Zhou,et al.  Dynamic strategy based parallel ant colony optimization on GPUs for TSPs , 2017, Science China Information Sciences.

[41]  Jason Maassen,et al.  Optimizing convolution operations on GPUs using adaptive tiling , 2014, Future Gener. Comput. Syst..

[42]  K. Sharma High performance GPU based optimized feature matching for computer vision applications , 2016 .

[43]  Qi Liu,et al.  Accelerating convolution-based detection model on GPU , 2015, 2015 International Conference on Estimation, Detection and Information Fusion (ICEDIF).

[44]  Hanspeter Pfister,et al.  Trainable Convolution Filters and Their Application to Face Recognition , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[45]  Wayne Luk,et al.  Have GPUs made FPGAs redundant in the field of video processing? , 2005, Proceedings. 2005 IEEE International Conference on Field-Programmable Technology, 2005..

[46]  Forrest N. Iandola,et al.  Communication-minimizing 2D convolution in GPU registers , 2013, 2013 IEEE International Conference on Image Processing.

[47]  Xiao Chen,et al.  A parallel and robust object tracking approach synthesizing adaptive Bayesian learning and improved incremental subspace learning , 2019, Frontiers of Computer Science.

[48]  Yuan Cheng,et al.  A string-wise CRDT algorithm for smart and large-scale collaborative editing systems , 2017, Adv. Eng. Informatics.

[49]  Yi Zhou,et al.  Parallel ant colony optimization on multi-core SIMD CPUs , 2018, Future Gener. Comput. Syst..

[50]  Greg Brown,et al.  A performance and energy comparison of convolution on GPUs, FPGAs, and multicore processors , 2013, TACO.

[51]  Martin D. F. Wong,et al.  Efficient aerial image simulation on multi-core SIMD CPU , 2013, 2013 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[52]  Fazhi He,et al.  An Efficient Particle Swarm Optimization for Large-Scale Hardware/Software Co-Design System , 2017, Int. J. Cooperative Inf. Syst..

[53]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[54]  Lei Shi,et al.  Fast Convolution Operations on Many-Core Architectures , 2015, 2015 IEEE 17th International Conference on High Performance Computing and Communications, 2015 IEEE 7th International Symposium on Cyberspace Safety and Security, and 2015 IEEE 12th International Conference on Embedded Software and Systems.

[55]  Eduardo Cuesta,et al.  Improving satellite image classification by using fractional type convolution filtering , 2010, Int. J. Appl. Earth Obs. Geoinformation.

[56]  B. Van de Wiele,et al.  Fast Fourier transforms for the evaluation of convolution products: CPU versus GPU implementation , 2013 .

[57]  Tobias Schaeffter,et al.  Accelerating the Nonequispaced Fast Fourier Transform on Commodity Graphics Hardware , 2008, IEEE Transactions on Medical Imaging.

[58]  Jeng-Shyang Pan,et al.  An efficient FPGA-based accelerator design for convolution , 2017, 2017 IEEE 8th International Conference on Awareness Science and Technology (iCAST).

[59]  Mateo Valero,et al.  Vector architectures: past, present and future , 1998, ICS '98.

[60]  Prabir Kumar Biswas,et al.  Image filtering in the block DCT domain using symmetric convolution , 2011, J. Vis. Commun. Image Represent..