Accelerator-Centered Programming on Heterogeneous Systems

Parallel many cores contribute to heterogeneous architectures and achieve high computation throughput. Working as coprocessors and connected to general-purpose CPUs via PCIe, those special-purpose cores usually work as float computing accelerators (ACC). The popular programming models typically offload the computing intensive parts to accelerator then aggregate results, which would result in a great amount of data transfer via PCIe. In this paper, we introduce an ACC-centered model to leverage the limited bandwidth of PCIe, increase performance, reduce idle time of ACC. In order to realize dada-near-computing, our ACC-centered model arms to program centered on ACC and the control intensive parts are offloaded to CPU. Both CPU and ACC are devoted to higher performance with their architect feature. Validation on the Tianhe-2 supercomputer shows that the implementation of ACC-centered LU competes with the highly optimized Intel MKL hybrid implementation and achieves about 5× speedup versus the CPU version.

[1]  Xuhao Chen,et al.  Adaptive Cache Management for Energy-Efficient GPU Computing , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[2]  Satoshi Matsuoka,et al.  Linpack evaluation on a supercomputer with heterogeneous accelerators , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[3]  Fred G. Gustavson,et al.  Recursion leads to automatic variable blocking for dense linear-algebra algorithms , 1997, IBM J. Res. Dev..

[4]  Jungwon Kim,et al.  Accelerating LINPACK with MPI-OpenCL on Clusters of Multi-GPU Nodes , 2015, IEEE Transactions on Parallel and Distributed Systems.

[5]  Samuel Williams,et al.  The Landscape of Parallel Computing Research: A View from Berkeley , 2006 .

[6]  P. G. Hipes,et al.  Gauss-Jordan inversion with pivoting on the Caltech Mark II hypercube , 1989, C3P.

[7]  Eric F. van de Velde,et al.  Experiments with Multicomputer LU-decomposition , 1990, Concurr. Pract. Exp..

[8]  Pradeep Dubey,et al.  Design and Implementation of the Linpack Benchmark for Single and Multi-node Systems Based on Intel® Xeon Phi Coprocessor , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[9]  Michael Lang,et al.  The reverse-acceleration model for programming petascale hybrid systems , 2009, IBM J. Res. Dev..

[10]  John J. Cannon,et al.  The Magma Algebra System I: The User Language , 1997, J. Symb. Comput..

[11]  Jack J. Dongarra,et al.  LU Factorization with Partial Pivoting for a Multicore System with Accelerators , 2013, IEEE Transactions on Parallel and Distributed Systems.

[12]  Jack J. Dongarra,et al.  Towards dense linear algebra for hybrid GPU accelerated manycore systems , 2009, Parallel Comput..

[13]  Jack Dongarra,et al.  Numerical Linear Algebra for High-Performance Computers , 1998 .

[14]  Kim M. Hazelwood,et al.  Where is the data? Why you cannot debate CPU vs. GPU performance without the answer , 2011, (IEEE ISPASS) IEEE INTERNATIONAL SYMPOSIUM ON PERFORMANCE ANALYSIS OF SYSTEMS AND SOFTWARE.

[15]  Tao Tang,et al.  OpenMC: Towards Simplifying Programming for TianHe Supercomputers , 2014, Journal of Computer Science and Technology.

[16]  Geoffrey C. Fox,et al.  Solving problems on concurrent processors: vol. 2 , 1990 .

[17]  Pradeep Dubey,et al.  Designing and dynamically load balancing hybrid LU for multi/many-core , 2011, Computer Science - Research and Development.

[18]  Feng Wang GPU-centered parallel model on heterogeneous multi-GPU clusters , 2012, Proceedings of 2012 2nd International Conference on Computer Science and Network Technology.