HPC Programming on Intel Many-Integrated-Core Hardware with MAGMA Port to Xeon Phi

This paper presents the design and implementation of several fundamental dense linear algebra (DLA) algorithms for multicore with Intel Xeon Phi coprocessors. In particular, we consider algorithms for solving linear systems. Further, we give an overview of the MAGMA MIC library, an open source, high performance library, that incorporates the developments presented here and, more broadly, provides the DLA functionality equivalent to that of the popular LAPACK library while targeting heterogeneous architectures that feature a mix of multicore CPUs and coprocessors.The LAPACK-compliance simplifies the use of the MAGMA MIC library in applications, while providing them with portably performant DLA. High performance is obtained through the use of the high-performance BLAS, hardware-specific tuning, and a hybridization methodology whereby we split the algorithm into computational tasks of various granularities. Execution of those tasks is properly scheduled over the heterogeneous hardware by minimizing data movements and mapping algorithmic requirements to the architectural strengths of the various heterogeneous hardware components. Our methodology and programming techniques are incorporated into the MAGMA MIC API, which abstracts the application developer fromthe specifics of the Xeon Phi architecture and is therefore applicable to algorithms beyond the scope of DLA.

[1]  Cédric Augonnet,et al.  StarPU: a unified platform for task scheduling on heterogeneous multicore architectures , 2011, Concurr. Comput. Pract. Exp..

[2]  Jack J. Dongarra,et al.  From CUDA to OpenCL: Towards a performance-portable solution for multi-platform GPU programming , 2012, Parallel Comput..

[3]  Samuel Williams,et al.  Optimization of geometric multigrid for emerging multi- and manycore processors , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[4]  Leslie G. Valiant,et al.  A bridging model for parallel computation , 1990, CACM.

[5]  James Reinders,et al.  Intel Xeon Phi Coprocessor High Performance Programming , 2013 .

[6]  Jesús Labarta,et al.  A dependency-aware task-based programming environment for multi-core architectures , 2008, 2008 IEEE International Conference on Cluster Computing.

[7]  Erik G. Boman,et al.  Factors Impacting Performance of Multithreaded Sparse Triangular Solve , 2010, VECPAR.

[8]  Monica S. Lam,et al.  Jade: a high-level, machine-independent language for parallel programming , 1993, Computer.

[9]  Asim YarKhan,et al.  Dynamic Task Execution on Shared and Distributed Memory Architectures , 2012 .

[10]  Jack Dongarra,et al.  Numerical linear algebra on emerging architectures: The PLASMA and MAGMA projects , 2009 .

[11]  Bradley C. Kuszmaul,et al.  Cilk: an efficient multithreaded runtime system , 1995, PPOPP '95.

[12]  Jack Dongarra,et al.  Dense Linear Algebra for Hybrid GPU-Based Systems , 2010, Scientific Computing with Multicore and Accelerators.

[13]  Jack Dongarra,et al.  Multithreading in the PLASMA Library , 2014 .

[14]  Jack Dongarra,et al.  LINPACK Users' Guide , 1987 .

[15]  Yi Guo,et al.  The habanero multicore software research project , 2009, OOPSLA Companion.

[16]  Basilio B. Fraguela,et al.  A framework for argument-based task synchronization with automatic detection of dependencies , 2013, Parallel Comput..

[17]  P. Hanrahan,et al.  Sequoia: Programming the Memory Hierarchy , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[18]  Leslie G. Valiant,et al.  Why BSP computers? (bulk-synchronous parallel computers) , 1993, [1993] Proceedings Seventh International Parallel Processing Symposium.