LU factorization on heterogeneous systems: an energy-efficient approach towards high performance
暂无分享,去创建一个
[1] Massimiliano Fatica. Accelerating linpack with CUDA on heterogenous clusters , 2009, GPGPU-2.
[2] Hyesoon Kim,et al. An integrated GPU power and performance model , 2010, ISCA.
[3] Jack Dongarra,et al. Numerical Linear Algebra for High-Performance Computers , 1998 .
[4] Volker Lindenstruth,et al. Optimized HPL for AMD GPU and multi-core CPU usage , 2011, Computer Science - Research and Development.
[5] Xiaoming Zhang,et al. Hybrid hierarchy storage system in MilkyWay-2 supercomputer , 2014, Frontiers of Computer Science.
[6] Pradeep Dubey,et al. Designing and dynamically load balancing hybrid LU for multi/many-core , 2011, Computer Science - Research and Development.
[7] Jack J. Dongarra,et al. Portable HPC Programming on Intel Many-Integrated-Core Hardware with MAGMA Port to Xeon Phi , 2013, PPAM.
[8] P. G. Hipes,et al. Gauss-Jordan inversion with pivoting on the Caltech Mark II hypercube , 1989, C3P.
[9] Eric F. van de Velde,et al. Experiments with Multicomputer LU-decomposition , 1990, Concurr. Pract. Exp..
[10] Jack J. Dongarra,et al. LU Factorization with Partial Pivoting for a Multicore System with Accelerators , 2013, IEEE Transactions on Parallel and Distributed Systems.
[11] John A. Gunnels,et al. Petascale computing with accelerators , 2009, PPoPP '09.
[12] Zizhong Chen,et al. A survey of power and energy efficient techniques for high performance numerical linear algebra operations , 2014, Parallel Comput..
[13] Laurent Albera,et al. Joint Eigenvalue Decomposition of Non-Defective Matrices Based on the LU Factorization With Application to ICA , 2015, IEEE Transactions on Signal Processing.
[14] Canqun Yang,et al. Optimizing Linpack Benchmark on GPU-Accelerated Petascale Supercomputer , 2011, Journal of Computer Science and Technology.
[15] Jack J. Dongarra,et al. Optimization for performance and energy for batched matrix computations on GPUs , 2015, GPGPU@PPoPP.
[16] Enrique S. Quintana-Ortí,et al. Reducing Energy Consumption of Dense Linear Algebra Operations on Hybrid CPU-GPU Platforms , 2012, 2012 IEEE 10th International Symposium on Parallel and Distributed Processing with Applications.
[17] Jian Li,et al. Power-efficient time-sensitive mapping in heterogeneous systems , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).
[18] G. C. Fox,et al. Solving Problems on Concurrent Processors , 1988 .
[19] Jack J. Dongarra,et al. A Framework for Batched and GPU-Resident Factorization Algorithms Applied to Block Householder Transformations , 2015, ISC.
[20] Jungwon Kim,et al. Accelerating LINPACK with MPI-OpenCL on Clusters of Multi-GPU Nodes , 2015, IEEE Transactions on Parallel and Distributed Systems.
[21] Mark A. Johnson,et al. Solving problems on concurrent processors. Vol. 1: General techniques and regular problems , 1988 .
[22] Stephen A. Jarvis,et al. Resident Block-Structured Adaptive Mesh Refinement on Thousands of Graphics Processing Units , 2015, 2015 44th International Conference on Parallel Processing.
[23] Xuhao Chen,et al. Adaptive Cache Management for Energy-Efficient GPU Computing , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.
[24] Satoshi Matsuoka,et al. Linpack evaluation on a supercomputer with heterogeneous accelerators , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).
[25] Fred G. Gustavson,et al. Recursion leads to automatic variable blocking for dense linear-algebra algorithms , 1997, IBM J. Res. Dev..
[26] Pradeep Dubey,et al. Design and Implementation of the Linpack Benchmark for Single and Multi-node Systems Based on Intel® Xeon Phi Coprocessor , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.