论文信息 - HetroCV: Auto-tuning Framework and Runtime for Image Processing and Computer Vision Applications on Heterogeneous Platform

HetroCV: Auto-tuning Framework and Runtime for Image Processing and Computer Vision Applications on Heterogeneous Platform

With the wide adoption of high-performance processors and accelerators, large-scale computer vision applications have gained great performance improvement. However, it often requires extensive experiments and expertise to achieve optimal performance from manually-tuned programs, and the programs often need to be re-tuned when transplanted to a different platform, or using a different system configuration. To overcome this problem, in this paper we proposed Hetro CV, a programmer-directed auto-tuning framework and runtime for computer vision applications on heterogeneous CPU-MIC platform. In Hetro CV auto-tuning framework, computation units in the application pipeline are categorized in to one of three patterns: Map, Stencil and MapReduce, and program statistics are extracted from units' meta-information. Machine learning is adopted to train models for each pattern using the tuned parameters and program statistics from trial run sets, so that when a new unit is presented, Hetro CV auto tuner can use the corresponding trained model to generate optimized tuning parameters. In Hetro CV runtime, performance models for processor and co-processor are built to predict the prospective execution time of each computation unit in the application pipeline. We adopted the maximum-throughput mapping strategy, thus each unit would be mapped dynamically to the processor/co-processor queue, which would generate the minimum overall execution time. Experiments on two medical image processing applications running on heterogeneous platform composed of Intel Xeon CPU and Intel Phi co-processor showed advanced performance over naive Open MP tuning and Genetic Algorithm (GA) based heuristic tuning.

Xin Qi | Manish Parashar | David J. Foran | Daihou Wang

[1] Chun Chen,et al. Speeding up Nek5000 with autotuning and specialization , 2010, ICS '10.

[2] Kunle Olukotun,et al. OptiML: An Implicitly Parallel Domain-Specific Language for Machine Learning , 2011, ICML.

[3] G. N. Rathna,et al. Parallel Implementation of LBP Based Face Recognition on GPU Using OpenCL , 2012, 2012 13th International Conference on Parallel and Distributed Computing, Applications and Technologies.

[4] Sanjay Ghemawat,et al. MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[5] Hyesoon Kim,et al. An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness , 2009, ISCA '09.

[6] Una-May O'Reilly,et al. Siblingrivalry: online autotuning through local competitions , 2012, CASES '12.

[7] Eric Darve,et al. Liszt: A domain specific language for building portable mesh-based PDE solvers , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[8] Jun Kong,et al. High-throughput Analysis of Large Microscopy Image Datasets on CPU-GPU Cluster Platforms , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[9] Alan Edelman,et al. PetaBricks: a language and compiler for algorithmic choice , 2009, PLDI '09.

[10] Mary W. Hall,et al. Towards making autotuning mainstream , 2013, Int. J. High Perform. Comput. Appl..

[11] Lin Yang,et al. Robust Segmentation of Overlapping Cells in Histopathology Specimens Using Parallel Seed Detection and Repulsive Level Set , 2012, IEEE Transactions on Biomedical Engineering.

[12] Lin Yang,et al. Content-based histopathology image retrieval using CometCloud , 2014, BMC Bioinformatics.

[13] Jun Kong,et al. Accelerating Large Scale Image Analyses on Parallel, CPU-GPU Equipped Systems , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[14] José Hiroki Saito,et al. Processing Neocognitron of Face Recognition on High Performance Environment Based on GPU with CUDA Architecture , 2008, 2008 20th International Symposium on Computer Architecture and High Performance Computing.

[15] Dorothea Heiss-Czedik,et al. An Introduction to Genetic Algorithms. , 1997, Artificial Life.

[16] Frédo Durand,et al. Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines , 2013, PLDI 2013.

[17] Venkatram Vishwanath,et al. GROPHECY: GPU performance projection from CPU code skeletons , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[18] Kunle Olukotun,et al. A Heterogeneous Parallel Framework for Domain-Specific Languages , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.

[19] Michael Garland,et al. Nitro: A Framework for Adaptive Code Variant Tuning , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[20] Chih-Jen Lin,et al. A comparison of methods for multiclass support vector machines , 2002, IEEE Trans. Neural Networks.

[21] Kevin Skadron,et al. Accelerating leukocyte tracking using CUDA: A case study in leveraging manycore coprocessors , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[22] John Paul Walters,et al. Evaluating the use of GPUs in liver image segmentation and HMMER database searches , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[23] Jiayuan Meng,et al. Improving GPU Performance Prediction with Data Transfer Modeling , 2013, 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum.

[24] Kunle Olukotun,et al. Forge: generating a high performance DSL implementation from a declarative specification , 2013, GPCE '13.

[25] Saman P. Amarasinghe,et al. Portable performance on heterogeneous architectures , 2013, ASPLOS '13.

[26] Jürgen Teich,et al. Generating Device-specific GPU Code for Local Operators in Medical Imaging , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.