Autotuning Numerical Dense Linear Algebra for Batched Computation With GPU Hardware Accelerators
暂无分享,去创建一个
Piotr Luszczek | Jakub Kurzak | Jack Dongarra | Mark Gates | Yaohung M. Tsai | J. Dongarra | P. Luszczek | J. Kurzak | M. Gates | Y. Tsai
[1] Thomas K. Gaylord,et al. Rigorous coupled-wave analysis of metallic surface-relief gratings , 1986 .
[2] Jack J. Dongarra,et al. Implementation and Tuning of Batched Cholesky Factorization and Solve for NVIDIA GPUs , 2016, IEEE Transactions on Parallel and Distributed Systems.
[3] Keshav Pingali,et al. Look Left, Look Right, Look Left Again: An Application of Fractal Symbolic Analysis to Linear Algebra Code Restructuring , 2004, International Journal of Parallel Programming.
[4] Shoaib Ashraf Kamil,et al. Productive High Performance Parallel Programming with Auto-tuned Domain-Specific Embedded Languages , 2012 .
[5] Jack J. Dongarra,et al. Automated empirical optimizations of software and the ATLAS project , 2001, Parallel Comput..
[6] F. Franchetti,et al. Automatic Application Tuning for HPC Architectures , 2014 .
[7] Michael T. Heath,et al. High-performance hybrid CPU and GPU parallel algorithm for digital volume correlation , 2015, Int. J. High Perform. Comput. Appl..
[8] Jack J. Dongarra,et al. Search Space Generation and Pruning System for Autotuners , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).
[9] Anamitra R. Choudhury,et al. Multifrontal Factorization of Sparse SPD Matrices on GPUs , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.
[10] Walter F. Tichy,et al. Atune-IL: An Instrumentation Language for Auto-tuning Parallel Applications , 2009, Euro-Par.
[11] Jack J. Dongarra,et al. Autotuning GEMM Kernels for the Fermi GPU , 2012, IEEE Transactions on Parallel and Distributed Systems.
[12] Jack J. Dongarra,et al. Towards batched linear solvers on accelerated hardware platforms , 2015, PPOPP.
[13] Hugh Alan Bruck,et al. Digital image correlation using Newton-Raphson method of partial differential correction , 1989 .
[14] Steven G. Johnson,et al. The Design and Implementation of FFTW3 , 2005, Proceedings of the IEEE.
[15] Jack J. Dongarra,et al. Acceleration of GPU-based Krylov solvers via data transfer reduction , 2015, Int. J. High Perform. Comput. Appl..
[16] Jack J. Dongarra,et al. Batched matrix computations on hardware accelerators based on GPUs , 2015, Int. J. High Perform. Comput. Appl..
[17] Jack Dongarra,et al. Sparse direct solvers with accelerators over DAG runtimes , 2012 .
[18] James Demmel,et al. Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology , 1997, ICS '97.
[19] Jack J. Dongarra,et al. Autotuning Batch Cholesky Factorization in CUDA with Interleaved Layout of Matrices , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).
[20] Jack J. Dongarra,et al. Accelerating collaborative filtering using concepts from high performance computing , 2015, 2015 IEEE International Conference on Big Data (Big Data).
[21] Massimiliano Fatica,et al. Power/Performance Trade-Offs of Small Batched LU Based Solvers on GPUs , 2013, Euro-Par.
[22] Katherine Yelick,et al. OSKI: A library of automatically tuned sparse matrix kernels , 2005 .
[23] Keshav Pingali,et al. Fractal symbolic analysis , 2000, TOPL.
[24] Yuefan Deng,et al. New trends in high performance computing , 2001, Parallel Computing.
[25] Jack Dongarra,et al. Numerical linear algebra on emerging architectures: The PLASMA and MAGMA projects , 2009 .
[26] Richard Veras,et al. Capturing the Expert: Generating Fast Matrix-Multiply Kernels with Spiral , 2014, VECPAR.
[27] Jack J. Dongarra,et al. Dense linear algebra solvers for multicore with GPU accelerators , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW).
[28] Ken Kennedy,et al. Automatic blocking of QR and LU factorizations for locality , 2004, MSP '04.
[29] Yifan Hu,et al. Collaborative Filtering for Implicit Feedback Datasets , 2008, 2008 Eighth IEEE International Conference on Data Mining.
[30] Jack J. Dongarra,et al. Performance-Portable Autotuning of OpenCL Kernels for Convolutional Layers of Deep Neural Networks , 2016, 2016 2nd Workshop on Machine Learning in HPC Environments (MLHPC).
[31] Jack J. Dongarra,et al. LU Factorization of Small Matrices: Accelerating Batched DGETRF on the GPU , 2014, 2014 IEEE Intl Conf on High Performance Computing and Communications, 2014 IEEE 6th Intl Symp on Cyberspace Safety and Security, 2014 IEEE 11th Intl Conf on Embedded Software and Syst (HPCC,CSS,ICESS).
[32] Franz Franchetti,et al. SPIRAL: Code Generation for DSP Transforms , 2005, Proceedings of the IEEE.
[33] Jack J. Dongarra,et al. A Fast Batched Cholesky Factorization on a GPU , 2014, 2014 43rd International Conference on Parallel Processing.
[34] Viktor K. Prasanna,et al. Tiling, Block Data Layout, and Memory Hierarchy Performance , 2003, IEEE Trans. Parallel Distributed Syst..
[35] Franz Franchetti,et al. FFTs with Near-Optimal Memory Access Through Block Data Layouts: Algorithm, Architecture and Design Automation , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).