Mapping Streaming Applications on Commodity Multi-CPU and GPU On-Chip Processors

In this paper, we consider the problem of efficiently executing streaming applications on commodity processors composed of several cores and an on-chip GPU. Streaming applications, such as those in vision and video analytic, consist of a pipeline of stages and are good candidates to take advantage of this type of platforms. We also consider that characteristics of the input may change while the application is running. Therefore, we propose a framework that adaptively finds the optimal mapping of the pipeline stages. The core of the framework is an analytical model coupled with information collected at runtime used to dynamically map each pipeline stage to the most efficient device, taking into consideration both performance and energy. Our experimental results show that for the evaluated applications running on two different architectures, our model always predicts the best configuration among the evaluated alternatives, and significantly reduces the amount of information that needs to be collected at runtime. This best configuration has, on the average, 20 percent higher throughput than the configuration recommended by a baseline state of the art approach, while the ratio throughput/energy is 43 percent higher. We have measured improvements in throughput and throughput/energy of up-to 81 and 204 percent, respectively, when the model is used to adapt to a video that changes from low to high definition.

[1]  Paul A. Viola,et al.  Fast Multi-view Face Detection , 2003 .

[2]  Sanjay K. Bose,et al.  An Introduction to Queueing Systems , 2002, Springer US.

[3]  Luca De Cicco,et al.  Skype video responsiveness to bandwidth variations , 2008, NOSSDAV.

[4]  Yves Robert,et al.  Performance and energy optimization of concurrent pipelined applications , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[5]  Kai Lu,et al.  Adaptive Optimization for Petascale Heterogeneous CPU/GPU Computing , 2010, 2010 IEEE International Conference on Cluster Computing.

[6]  Kevin Skadron,et al.  A characterization of the Rodinia benchmark suite with comparison to contemporary CMP workloads , 2010, IEEE International Symposium on Workload Characterization (IISWC'10).

[7]  Thomas S. Huang,et al.  A data driven method for feature transformation , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[8]  Eduard Ayguadé,et al.  Self-Adaptive OmpSs Tasks in Heterogeneous Environments , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[9]  Rahul Khanna,et al.  RAPL: Memory power estimation and capping , 2010, 2010 ACM/IEEE International Symposium on Low-Power Electronics and Design (ISLPED).

[10]  Cédric Augonnet,et al.  StarPU: a unified platform for task scheduling on heterogeneous multicore architectures , 2011, Concurr. Comput. Pract. Exp..

[11]  Jack J. Dongarra,et al.  Power monitoring with PAPI for extreme scale architectures and dataflow-based programming models , 2014, 2014 IEEE International Conference on Cluster Computing (CLUSTER).

[12]  G. D. Peterson,et al.  Power Aware Computing on GPUs , 2012, 2012 Symposium on Application Accelerators in High Performance Computing.

[13]  Ehsan Totoni,et al.  Easy, fast, and energy-efficient object detection on heterogeneous on-chip architectures , 2013, TACO.

[14]  Shuaiwen Song,et al.  A Simplified and Accurate Model of Power-Performance Efficiency on Emergent GPU Architectures , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[15]  Nam Sung Kim,et al.  GPUWattch: enabling energy optimizations in GPGPUs , 2013, ISCA.

[16]  W. J. Gordon,et al.  Closed Queuing Systems with Exponential Servers , 1967, Oper. Res..

[17]  Takeo Kanade,et al.  An Iterative Image Registration Technique with an Application to Stereo Vision , 1981, IJCAI.

[18]  James Reinders,et al.  Intel threading building blocks - outfitting C++ for multi-core processor parallelism , 2007 .

[19]  Pao-Ann Hsiung,et al.  Multi-objective exploitation of pipeline parallelism using clustering, replication and duplication in embedded multi-core systems , 2013, J. Syst. Archit..

[20]  Jason Cong,et al.  Energy-efficient scheduling on heterogeneous multi-core architectures , 2012, ISLPED '12.

[21]  Scott T. Acton,et al.  Speckle reducing anisotropic diffusion , 2002, IEEE Trans. Image Process..

[22]  Gabriela Csurka,et al.  Visual categorization with bags of keypoints , 2002, eccv 2004.

[23]  Hyesoon Kim,et al.  Qilin: Exploiting parallelism on heterogeneous multiprocessors with adaptive mapping , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[24]  Cipriano Galindo,et al.  Adaptable Web interfaces for networked robots , 2005, 2005 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[25]  Anselmo Lastra,et al.  An energy model for graphics processing units , 2010, 2010 IEEE International Conference on Computer Design.

[26]  Serge J. Belongie,et al.  SD-VBS: The San Diego Vision Benchmark Suite , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[27]  Bruno Raffin,et al.  XKaapi: A Runtime System for Data-Flow Task Programming on Heterogeneous Architectures , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[28]  Silvio Savarese,et al.  MEVBench: A mobile computer vision benchmarking suite , 2011, 2011 IEEE International Symposium on Workload Characterization (IISWC).

[29]  Robert Grimm,et al.  Dynamic expressivity with static optimization for streaming languages , 2013, DEBS '13.

[30]  Rafael Asenjo,et al.  Productive interface to map streaming applications on heterogeneous processors , 2015 .

[31]  Rafael Asenjo,et al.  Analytical Modeling of Pipeline Parallelism , 2009, 2009 18th International Conference on Parallel Architectures and Compilation Techniques.