Analyzing program flow within a many-kernel OpenCL application

Many developers have begun to realize that heterogeneous multi-core and many-core computer systems can provide significant performance opportunities to a range of applications. Typical applications possess multiple components that can be parallelized; developers need to be equipped with proper performance tools to analyze program flow and identify application bottlenecks. In this paper, we analyze and profile the components of the Speeded Up Robust Features (SURF) Computer Vision algorithm written in OpenCL. Our profiling framework is developed using built-in OpenCL API function calls, without the need for an external profiler. We show we can begin to identify performance bottlenecks and performance issues present in individual components on different hardware platforms. We demonstrate that by using run-time profiling using the OpenCL specification, we can provide an application developer with a fine-grained look at performance, and that this information can be used to tailor performance improvements for specific platforms.

[1]  Allen D. Malony,et al.  An experimental approach to performance measurement of heterogeneous parallel applications using CUDA , 2010, ICS '10.

[2]  Serge J. Belongie,et al.  SD-VBS: The San Diego Vision Benchmark Suite , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[3]  Fei Su,et al.  Face recognition using SURF features , 2009, International Symposium on Multispectral Image Processing and Pattern Recognition.

[4]  Matthew A. Brown,et al.  Automatic Panoramic Image Stitching using Invariant Features , 2007, International Journal of Computer Vision.

[5]  Wen-mei W. Hwu,et al.  Optimization principles and application performance evaluation of a multithreaded GPU using CUDA , 2008, PPoPP.

[6]  Pat Hanrahan,et al.  Brook for GPUs: stream computing on graphics hardware , 2004, ACM Trans. Graph..

[7]  Christopher J. Hughes,et al.  Computer Vision on Multi-Core Processors: Articulated Body Tracking , 2007, 2007 IEEE International Conference on Multimedia and Expo.

[8]  Zhen Fang,et al.  Performance characterization and optimization of mobile augmented reality on handheld platforms , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[9]  Shirley Moore,et al.  Continuous Runtime Profiling of OpenMP Applications , 2007, PARCO.

[10]  Hubert Nguyen,et al.  GPU Gems 3 , 2007 .

[11]  Christopher Hunt,et al.  Notes on the OpenSURF Library , 2009 .

[12]  Mark J. Harris,et al.  Parallel Prefix Sum (Scan) with CUDA , 2011 .

[13]  Chi Hay Tong,et al.  ECE 1724 Project Speeded-Up Speeded-Up Robust Features , 2009 .

[14]  Budirijanto Purnomo,et al.  ATI Stream Profiler: a tool to optimize an OpenCL kernel on ATI Radeon GPUs , 2010, SIGGRAPH '10.

[15]  Martin C. Herbordt,et al.  GPU acceleration of a production molecular docking code , 2009, GPGPU-2.

[16]  Ray W. Grout,et al.  Accelerating S3D: A GPGPU Case Study , 2009, Euro-Par Workshops.

[17]  Amy Apon,et al.  Accelerating Image Feature Comparisons using CUDA on Commodity Hardware , 2010, HiPC 2010.

[18]  Lance M. Berc,et al.  Continuous profiling: where have all the cycles gone? , 1997, ACM Trans. Comput. Syst..

[19]  Nan Zhang,et al.  Computing Optimised Parallel Speeded-Up Robust Features (P-SURF) on Multi-Core Processors , 2010, International Journal of Parallel Programming.

[20]  Grigori Fursin,et al.  Predictive Runtime Code Scheduling for Heterogeneous Architectures , 2008, HiPEAC.

[21]  Jun Luo,et al.  Person-Specific SIFT Features for Face Recognition , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.