Mozart : Efficient Composition of Library Functions for Heterogeneous Execution

Current processor trend is to couple a commodity processor with a GPU, a co-processor, or an accelerator. To unleash the full computational power of such heterogeneous systems is a daunting task: programmers often resort to heterogeneous scheduling runtime frameworks that use device specific library routines. However, highly-tuned libraries do not compose very well across heterogeneous architectures. That is, important performance-oriented optimizations such as data locality and reuse “across” library calls is not fully exploited. In this paper, we present a framework, called Mozart, to extend existing library frameworks to efficiently compose a sequence of library calls for heterogeneous execution. Mozart consists of two components: library description (LD) and library composition runtime. We advocate library writers to wrap existing libraries using LD in order to provide their performance parameters on heterogeneous cores, no programmer intervention is necessary. Our runtime performs composition of libraries via task-fission, load balances among heterogeneous cores using information from LD, and automatically adapts to runtime behavior of an application. We evaluate Mozart on a Xeon + 2 Xeon Phi system using the High Performance Linpack benchmark which is the most popular benchmark to rank supercomputers in TOP500 and show GFLOPS improvement of 31.7 % over MKL with Automatic Offload and 6.7 % over hand-optimized ninja code.

[1]  Manuel Prieto,et al.  HeSP: A Simulation Framework for Solving the Task Scheduling-Partitioning Problem on Heterogeneous Architectures , 2016, Euro-Par.

[2]  Alan Edelman,et al.  Julia: A Fast Dynamic Language for Technical Computing , 2012, ArXiv.

[3]  Philippas Tsigas,et al.  On dynamic load balancing on graphics processors , 2008, GH '08.

[4]  Geoffrey C. Fox,et al.  Automatic Task Re-organization in MapReduce , 2011, 2011 IEEE International Conference on Cluster Computing.

[5]  Nancy M. Amato,et al.  A framework for adaptive algorithm selection in STAPL , 2005, PPoPP.

[6]  R. Govindarajan,et al.  Fluidic Kernels: Cooperative Execution of OpenCL Programs on Multiple Heterogeneous Devices , 2014, CGO '14.

[7]  Long Chen,et al.  Dynamic load balancing on single- and multi-GPU systems , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[8]  Laxmi N. Bhuyan,et al.  A dynamic self-scheduling scheme for heterogeneous multiprocessor architectures , 2013, TACO.

[9]  Satoshi Matsuoka,et al.  An efficient, model-based CPU-GPU heterogeneous FFT library , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[10]  Ken Kennedy,et al.  Telescoping Languages: A Strategy for Automatic Generation of Scientific Problem-Solving Systems from Annotated Libraries , 2001, J. Parallel Distributed Comput..

[11]  Jack J. Dongarra,et al.  LU, QR, and Cholesky factorizations: Programming model, performance analysis and optimization techniques for the Intel Knights Landing Xeon Phi , 2016, 2016 IEEE High Performance Extreme Computing Conference (HPEC).

[12]  Eduard Ayguadé,et al.  Criticality-Aware Dynamic Task Scheduling for Heterogeneous Architectures , 2015, ICS.

[13]  Virendra J. Marathe,et al.  Callisto: co-scheduling parallel runtime systems , 2014, EuroSys '14.

[14]  George Bosilca,et al.  Hierarchical DAG Scheduling for Hybrid Distributed Systems , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium.

[15]  Gagan Agrawal,et al.  A dynamic scheduling framework for emerging heterogeneous systems , 2011, 2011 18th International Conference on High Performance Computing.

[16]  Pradeep Dubey,et al.  Can traditional programming bridge the Ninja performance gap for parallel computing applications , 2012, ISCA 2012.

[17]  Hiroshi Nakamura,et al.  Communication Library to Overlap Computation and Communication for OpenCL Application , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum.

[18]  Andra Hugo,et al.  Composing multiple StarPU applications over heterogeneous machines: A supervised approach , 2013, 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum.

[19]  Saman P. Amarasinghe,et al.  Portable performance on heterogeneous architectures , 2013, ASPLOS '13.

[20]  David R. Kaeli,et al.  Exploring the multiple-GPU design space , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[21]  Polyvios Pratikakis,et al.  BDDT: Block-Level Dynamic Dependence Analysis for Task-Based Parallelism , 2013, APPT.

[22]  Lawrence Rauchwerger,et al.  An Adaptive Algorithm Selection Framework for Reduction Parallelization , 2006, IEEE Transactions on Parallel and Distributed Systems.

[23]  Eduard Ayguadé,et al.  Implementing OmpSs support for regions of data in architectures with multiple address spaces , 2013, ICS '13.

[24]  Benjamin Hindman,et al.  Lithe: enabling efficient composition of parallel libraries , 2009 .

[25]  Cédric Augonnet,et al.  StarPU: a unified platform for task scheduling on heterogeneous multicore architectures , 2011, Concurr. Comput. Pract. Exp..

[26]  Gagan Agrawal,et al.  Compiler and runtime support for enabling generalized reduction computations on heterogeneous parallel configurations , 2010, ICS '10.

[27]  Nancy M. Amato,et al.  Composing Algorithmic Skeletons to Express High-Performance Scientific Applications , 2015, ICS.

[28]  Michael F. P. O'Boyle,et al.  OpenCL Task Partitioning in the Presence of GPU Contention , 2013, LCPC.

[29]  Rajkishore Barik,et al.  Efficient Mapping of Irregular C++ Applications to Integrated GPUs , 2014, CGO '14.

[30]  Jungwon Kim,et al.  Achieving a single compute device image in OpenCL for multiple GPUs , 2011, PPoPP '11.

[31]  Scott A. Mahlke,et al.  Transparent CPU-GPU collaboration for data-parallel kernels on heterogeneous systems , 2013, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques.

[32]  Jungwon Kim,et al.  SnuCL: an OpenCL framework for heterogeneous CPU/GPU clusters , 2012, ICS '12.

[33]  Pradeep Dubey,et al.  Design and Implementation of the Linpack Benchmark for Single and Multi-node Systems Based on Intel® Xeon Phi Coprocessor , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[34]  Yi Yang,et al.  COMP: Compiler Optimizations for Manycore Processors , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[35]  Eduard Ayguadé,et al.  Overlapping communication and computation by using a hybrid MPI/SMPSs approach , 2010, ICS '10.

[36]  Jack J. Dongarra,et al.  A scalable framework for heterogeneous GPU-based clusters , 2012, SPAA '12.

[37]  Calvin Lin,et al.  Broadway: A Software Architecture for Scientific Computing , 2000, The Architecture of Scientific Software.

[38]  Eduard Ayguadé,et al.  SSMART: smart scheduling of multi-architecture tasks on heterogeneous systems , 2015, WACCPD '15.

[39]  Hyesoon Kim,et al.  Qilin: Exploiting parallelism on heterogeneous multiprocessors with adaptive mapping , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[40]  Vivek Sarkar,et al.  Dynamic Task Parallelism with a GPU Work-Stealing Runtime System , 2011, LCPC.

[41]  Pradeep Dubey,et al.  Large-scale energy-efficient graph traversal: A path to efficient data-intensive supercomputing , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.