HetroCV: Auto-tuning Framework and Runtime for Image Processing and Computer Vision Applications on Heterogeneous Platform

With the wide adoption of high-performance processors and accelerators, large-scale computer vision applications have gained great performance improvement. However, it often requires extensive experiments and expertise to achieve optimal performance from manually-tuned programs, and the programs often need to be re-tuned when transplanted to a different platform, or using a different system configuration. To overcome this problem, in this paper we proposed Hetro CV, a programmer-directed auto-tuning framework and runtime for computer vision applications on heterogeneous CPU-MIC platform. In Hetro CV auto-tuning framework, computation units in the application pipeline are categorized in to one of three patterns: Map, Stencil and MapReduce, and program statistics are extracted from units' meta-information. Machine learning is adopted to train models for each pattern using the tuned parameters and program statistics from trial run sets, so that when a new unit is presented, Hetro CV auto tuner can use the corresponding trained model to generate optimized tuning parameters. In Hetro CV runtime, performance models for processor and co-processor are built to predict the prospective execution time of each computation unit in the application pipeline. We adopted the maximum-throughput mapping strategy, thus each unit would be mapped dynamically to the processor/co-processor queue, which would generate the minimum overall execution time. Experiments on two medical image processing applications running on heterogeneous platform composed of Intel Xeon CPU and Intel Phi co-processor showed advanced performance over naive Open MP tuning and Genetic Algorithm (GA) based heuristic tuning.

[1]  Chun Chen,et al.  Speeding up Nek5000 with autotuning and specialization , 2010, ICS '10.

[2]  Kunle Olukotun,et al.  OptiML: An Implicitly Parallel Domain-Specific Language for Machine Learning , 2011, ICML.

[3]  G. N. Rathna,et al.  Parallel Implementation of LBP Based Face Recognition on GPU Using OpenCL , 2012, 2012 13th International Conference on Parallel and Distributed Computing, Applications and Technologies.

[4]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[5]  Hyesoon Kim,et al.  An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness , 2009, ISCA '09.

[6]  Una-May O'Reilly,et al.  Siblingrivalry: online autotuning through local competitions , 2012, CASES '12.

[7]  Eric Darve,et al.  Liszt: A domain specific language for building portable mesh-based PDE solvers , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[8]  Jun Kong,et al.  High-throughput Analysis of Large Microscopy Image Datasets on CPU-GPU Cluster Platforms , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[9]  Alan Edelman,et al.  PetaBricks: a language and compiler for algorithmic choice , 2009, PLDI '09.

[10]  Mary W. Hall,et al.  Towards making autotuning mainstream , 2013, Int. J. High Perform. Comput. Appl..

[11]  Lin Yang,et al.  Robust Segmentation of Overlapping Cells in Histopathology Specimens Using Parallel Seed Detection and Repulsive Level Set , 2012, IEEE Transactions on Biomedical Engineering.

[12]  Lin Yang,et al.  Content-based histopathology image retrieval using CometCloud , 2014, BMC Bioinformatics.

[13]  Jun Kong,et al.  Accelerating Large Scale Image Analyses on Parallel, CPU-GPU Equipped Systems , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[14]  José Hiroki Saito,et al.  Processing Neocognitron of Face Recognition on High Performance Environment Based on GPU with CUDA Architecture , 2008, 2008 20th International Symposium on Computer Architecture and High Performance Computing.

[15]  Dorothea Heiss-Czedik,et al.  An Introduction to Genetic Algorithms. , 1997, Artificial Life.

[16]  Frédo Durand,et al.  Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines , 2013, PLDI 2013.

[17]  Venkatram Vishwanath,et al.  GROPHECY: GPU performance projection from CPU code skeletons , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[18]  Kunle Olukotun,et al.  A Heterogeneous Parallel Framework for Domain-Specific Languages , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.

[19]  Michael Garland,et al.  Nitro: A Framework for Adaptive Code Variant Tuning , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[20]  Chih-Jen Lin,et al.  A comparison of methods for multiclass support vector machines , 2002, IEEE Trans. Neural Networks.

[21]  Kevin Skadron,et al.  Accelerating leukocyte tracking using CUDA: A case study in leveraging manycore coprocessors , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[22]  John Paul Walters,et al.  Evaluating the use of GPUs in liver image segmentation and HMMER database searches , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[23]  Jiayuan Meng,et al.  Improving GPU Performance Prediction with Data Transfer Modeling , 2013, 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum.

[24]  Kunle Olukotun,et al.  Forge: generating a high performance DSL implementation from a declarative specification , 2013, GPCE '13.

[25]  Saman P. Amarasinghe,et al.  Portable performance on heterogeneous architectures , 2013, ASPLOS '13.

[26]  Jürgen Teich,et al.  Generating Device-specific GPU Code for Local Operators in Medical Imaging , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.