Optimizing Streaming Parallelism on Heterogeneous Many-Core Architectures

As many-core accelerators keep integrating more processing units, it becomes increasingly more difficult for a parallel application to make effective use of all available resources. An effective way of improving hardware utilization is to exploit spatial and temporal sharing of the heterogeneous processing units by multiplexing computation and communication tasks – a strategy known as heterogeneous streaming. Achieving effective heterogeneous streaming requires carefully partitioning hardware among tasks, and matching the granularity of task parallelism to the resource partition. However, finding the right resource partitioning and task granularity is extremely challenging, because there is a large number of possible solutions and the optimal solution varies across programs and datasets. This article presents an automatic approach to quickly derive a good solution for hardware resource partition and task granularity for task-based parallel applications on heterogeneous many-core architectures. Our approach employs a performance model to estimate the resulting performance of the target application under a given resource partition and task granularity configuration. The model is used as a utility to quickly search for a good configuration at runtime. Instead of hand-crafting an analytical model that requires expert insights into low-level hardware details, we employ machine learning techniques to automatically learn it. We achieve this by first learning a predictive model offline using training programs. The learned model can then be used to predict the performance of any unseen program at runtime. We apply our approach to 39 representative parallel applications and evaluate it on two representative heterogeneous many-core platforms: a CPU-XeonPhi platform and a CPU-GPU platform. Compared to the single-stream version, our approach achieves, on average, a 1.6x and 1.1x speedup on the XeonPhi and the GPU platform, respectively. These results translate to over 93 percent of the performance delivered by a theoretically perfect predictor.

[1]  Hervé Paulino,et al.  Stream Processing on Hybrid CPU/Intel® Xeon Phi™ Systems , 2018, Euro-Par.

[2]  Michael F. P. O'Boyle,et al.  Automatic and Portable Mapping of Data Parallel Programs to OpenCL for GPU-Based Heterogeneous Systems , 2014, ACM Trans. Archit. Code Optim..

[3]  Christoph W. Kessler,et al.  Auto-tuning SkePU: a multi-backend skeleton programming framework for multi-GPU systems , 2011, IWMSE '11.

[4]  Michael F. P. O'Boyle,et al.  Portable mapping of data parallel programs to OpenCL for heterogeneous systems , 2013, Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[5]  J. Xu OpenCL – The Open Standard for Parallel Programming of Heterogeneous Systems , 2009 .

[6]  Michael F. P. O'Boyle,et al.  Exploitation of GPUs for the Parallelisation of Probably Parallel Legacy Code , 2014, CC.

[7]  Rudolf Eigenmann,et al.  OpenMP to GPGPU: a compiler framework for automatic translation and optimization , 2009, PPoPP '09.

[8]  Michael F. P. O'Boyle,et al.  Partitioning streaming parallelism for multi-cores: A machine learning based approach , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[9]  Bingsheng He,et al.  MrPhi: An Optimized MapReduce Framework on Intel Xeon Phi Coprocessors , 2015, IEEE Transactions on Parallel and Distributed Systems.

[10]  Hyesoon Kim,et al.  Qilin: Exploiting parallelism on heterogeneous multiprocessors with adaptive mapping , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[11]  Vikram S. Adve,et al.  LLVM: a compilation framework for lifelong program analysis & transformation , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[12]  Frédo Durand,et al.  Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines , 2013, PLDI 2013.

[13]  Jeffrey S. Vetter,et al.  A Survey of CPU-GPU Heterogeneous Computing Techniques , 2015, ACM Comput. Surv..

[14]  Wolfgang Ertel,et al.  On the Definition of Speedup , 1994, PARLE.

[15]  H. Hotelling Analysis of a complex of statistical variables into principal components. , 1933 .

[16]  Jason Maassen,et al.  Performance Models for CPU-GPU Data Transfers , 2014, 2014 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[17]  Yansong Feng,et al.  Proteus: network-aware web browsing on heterogeneous mobile systems , 2018, CoNEXT.

[18]  Hao Wang,et al.  Exploring and Analyzing the Real Impact of Modern On-Package Memory on HPC Scientific Kernels , 2017, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.

[19]  Rudolf Eigenmann,et al.  OpenMPC: Extended OpenMP Programming and Tuning for GPUs , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[20]  Michael F. P. O'Boyle,et al.  Using machine learning to partition streaming programs , 2013, ACM Trans. Archit. Code Optim..

[21]  Alejandro Duran,et al.  Heterogeneous Streaming , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[22]  Bingsheng He,et al.  Many-core needs fine-grained scheduling: A case study of query processing on Intel Xeon Phi processors , 2018, J. Parallel Distributed Comput..

[23]  Michael F. P. O'Boyle,et al.  Smart, adaptive mapping of parallelism in the presence of external workload , 2013, Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[24]  Sabri Pllana,et al.  HSTREAM: A Directive-Based Language Extension for Heterogeneous Stream Computing , 2018, 2018 IEEE International Conference on Computational Science and Engineering (CSE).

[25]  Tao Tang,et al.  Evaluating the Performance Impact of Multiple Streams on the MIC-Based Heterogeneous Platform , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[26]  Barry Porter,et al.  Improving Spark Application Throughput Via Memory Aware Task Co-location: A Mixture of Experts Approach , 2017 .

[27]  Alexey L. Lastovetsky,et al.  Model-Based Optimization of EULAG Kernel on Intel Xeon Phi Through Load Imbalancing , 2017, IEEE Transactions on Parallel and Distributed Systems.

[28]  Tao Wang,et al.  Bootstrapping Parameter Space Exploration for Fast Tuning , 2018, ICS.

[29]  Jie Shen,et al.  Workload Partitioning for Accelerating Applications on Heterogeneous Platforms , 2016, IEEE Transactions on Parallel and Distributed Systems.

[30]  B. Manly Multivariate Statistical Methods : A Primer , 1986 .

[31]  Jaejin Lee,et al.  An Auto-Tuner for OpenCL Work-Group Size on GPUs , 2018, IEEE Transactions on Parallel and Distributed Systems.

[32]  Kenli Li,et al.  UHCL-Darknet: An OpenCL-based Deep Neural Network Framework for Heterogeneous Multi-/Many-core Clusters , 2018, ICPP.

[33]  Canqun Yang,et al.  Evaluating Multiple Streams on Heterogeneous Platforms , 2016, Parallel Process. Lett..

[34]  Luis Miguel Sánchez,et al.  Automatic CPU/GPU Generation of Multi-versioned OpenCL Kernels for C++ Scientific Applications , 2017, International Journal of Parallel Programming.

[35]  Zheng Wang,et al.  Machine Learning in Compiler Optimization , 2018, Proceedings of the IEEE.

[36]  Takahiro Katagiri,et al.  Auto-Tuning on NUMA and Many-Core Environments with an FDM Code , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[37]  Shoaib Kamil,et al.  OpenTuner: An extensible framework for program autotuning , 2014, 2014 23rd International Conference on Parallel Architecture and Compilation (PACT).

[38]  Zheng Wang,et al.  Adaptive optimization for OpenCL programs on embedded heterogeneous systems , 2017, LCTES.

[39]  Lu Yuan,et al.  Using Machine Learning to Optimize Web Interactions on Heterogeneous Mobile Systems , 2019, IEEE Access.

[40]  Ling Gao,et al.  Optimise web browsing on heterogeneous mobile platforms: A machine learning based approach , 2017, IEEE INFOCOM 2017 - IEEE Conference on Computer Communications.

[41]  Eddy Z. Zhang,et al.  KernelGen -- The Design and Implementation of a Next Generation Compiler Platform for Accelerating Numerical Models on GPUs , 2014, 2014 IEEE International Parallel & Distributed Processing Symposium Workshops.

[42]  P. Sadayappan,et al.  Domain-Specific Optimization and Generation of High-Performance GPU Code for Stencil Computations , 2018, Proceedings of the IEEE.

[43]  Satoshi Matsuoka,et al.  Auto-tuning 3-D FFT library for CUDA GPUs , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[44]  Torsten Hoefler,et al.  Polly-ACC Transparent compilation to heterogeneous hardware , 2016, ICS.

[45]  Timothy B. Costa,et al.  Optimizing Matrix Multiplication on Intel® Xeon Phi TH x200 Architecture , 2017, 2017 IEEE 24th Symposium on Computer Arithmetic (ARITH).

[46]  Chris Cummins,et al.  End-to-End Deep Learning of Optimization Heuristics , 2017, 2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[47]  Michael F. P. O'Boyle,et al.  Integrating profile-driven parallelism detection and machine-learning-based mapping , 2014, TACO.

[48]  Jianbin Fang,et al.  A Comprehensive Performance Comparison of CUDA and OpenCL , 2011, 2011 International Conference on Parallel Processing.

[49]  Yehia El-khatib,et al.  Adaptive deep learning model selection on embedded systems , 2018, LCTES.

[50]  David Cox,et al.  Input-Aware Auto-Tuning of Compute-Bound HPC Kernels , 2017, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.

[51]  Samuel Williams,et al.  Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[52]  Tao Tang,et al.  Streaming Applications on Heterogeneous Platforms , 2016, NPC.

[53]  Zheng Gong,et al.  Software pipelining for graphic processing unit acceleration: Partition, scheduling and granularity , 2016, Int. J. High Perform. Comput. Appl..

[54]  Bingsheng He,et al.  Improving Main Memory Hash Joins on Intel Xeon Phi Processors: An Experimental Approach , 2015, Proc. VLDB Endow..

[55]  Michael F. P. O'Boyle,et al.  MILEPOST GCC: machine learning based research compiler , 2008 .

[56]  Michael F. P. O'Boyle,et al.  Towards a holistic approach to auto-parallelization: integrating profile-driven parallelism detection and machine-learning based mapping , 2009, PLDI '09.

[57]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[58]  Xiaobing Feng,et al.  Predicting Cross-Core Performance Interference on Multicore Processors with Regression Analysis , 2016, IEEE Transactions on Parallel and Distributed Systems.

[59]  José Ignacio Benavides Benítez,et al.  Performance models for asynchronous data transfers on consumer Graphics Processing Units , 2012, J. Parallel Distributed Comput..

[60]  Tao Tang,et al.  LU factorization on heterogeneous systems: an energy-efficient approach towards high performance , 2016, Computing.

[61]  Peng Zhang,et al.  Auto-tuning Streamed Applications on Intel Xeon Phi , 2018, 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[62]  John D. Owens,et al.  GPU Computing , 2008, Proceedings of the IEEE.

[63]  Erwin Kreyszig,et al.  Advanced Engineering Mathematics 10th Edition , 2016 .

[64]  Scott B. Baden,et al.  Modeling and predicting performance of high performance computing applications on hardware accelerators , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum.

[65]  Yun Liang,et al.  Optimizing and auto-tuning scale-free sparse matrix-vector multiplication on Intel Xeon Phi , 2015, 2015 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[66]  Michael F. P. O'Boyle,et al.  Smart multi-task scheduling for OpenCL programs on CPU/GPU heterogeneous platforms , 2014, 2014 21st International Conference on High Performance Computing (HiPC).

[67]  Jianbin Fang,et al.  Optimizing Sparse Matrix–Vector Multiplications on an ARMv8-based Many-Core Architecture , 2019, International Journal of Parallel Programming.