Parallel Pipeline on Heterogeneous Multi-processing Architectures

We address the problem of providing support for executing single streaming applications implemented as a pipeline of stages that run on heterogeneous chips comprised of several cores and one on-chip GPU. In this paper, we present an API that allows the user to specify the type of parallelism exploited by each pipeline stage running on the CPU multicore, the mapping of the pipeline stages to the devices (GPU or CPU), and the number of active threads. Using as case of study a real streaming application, we evaluate how these parameters affect the performance and energy efficiency of a heterogeneous on-chip processor (Exynos 5 Octa) that has three different computational cores: a GPU, an A15 quad-core and an A7 quad-core. We also explore some memory optimizations and find that while their performance impact depends on the granularity type, they usually reduce energy consumption.

[1]  Rafael Asenjo,et al.  Reducing overheads of dynamic scheduling on heterogeneous chips , 2015, ArXiv.

[2]  Hyesoon Kim,et al.  Qilin: Exploiting parallelism on heterogeneous multiprocessors with adaptive mapping , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[3]  Eduard Ayguadé,et al.  Self-Adaptive OmpSs Tasks in Heterogeneous Environments , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[4]  Kai Lu,et al.  Adaptive Optimization for Petascale Heterogeneous CPU/GPU Computing , 2010, 2010 IEEE International Conference on Cluster Computing.

[5]  James Reinders,et al.  Intel threading building blocks - outfitting C++ for multi-core processor parallelism , 2007 .

[6]  Rafael Asenjo,et al.  Mapping Streaming Applications on Commodity Multi-CPU and GPU On-Chip Processors , 2016, IEEE Transactions on Parallel and Distributed Systems.

[7]  Thomas S. Huang,et al.  A data driven method for feature transformation , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[8]  Paul A. Viola,et al.  Fast Multi-view Face Detection , 2003 .

[9]  Cédric Augonnet,et al.  StarPU: a unified platform for task scheduling on heterogeneous multicore architectures , 2011, Concurr. Comput. Pract. Exp..

[10]  Marco Danelutto,et al.  FastFlow: High-level and Efficient Streaming on Multi-core , 2017 .

[11]  Silvio Savarese,et al.  MEVBench: A mobile computer vision benchmarking suite , 2011, 2011 IEEE International Symposium on Workload Characterization (IISWC).

[12]  Robert Grimm,et al.  Dynamic expressivity with static optimization for streaming languages , 2013, DEBS '13.