CoCoPeLia: Communication-Computation Overlap Prediction for Efficient Linear Algebra on GPUs
暂无分享,去创建一个
Nectarios Koziris | Georgios I. Goumas | Nikela Papadopoulou | Petros Anastasiadis | N. Koziris | G. Goumas | Petros Anastasiadis | Nikela Papadopoulou
[1] Robert A. van de Geijn,et al. BLIS: A Framework for Rapidly Instantiating BLAS Functionality , 2015, ACM Trans. Math. Softw..
[2] Torsten Hoefler,et al. LogGP in theory and practice - An in-depth analysis of modern interconnection networks and benchmarking methods for collective operations , 2009, Simul. Model. Pract. Theory.
[3] José Ignacio Benavides Benítez,et al. Performance models for asynchronous data transfers on consumer Graphics Processing Units , 2012, J. Parallel Distributed Comput..
[4] Simon See,et al. An Evaluation of Unified Memory Technology on NVIDIA GPUs , 2015, 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.
[5] Jack J. Dongarra,et al. Basic Linear Algebra Subprograms Technical (Blast) Forum Standard (1) , 2002, Int. J. High Perform. Comput. Appl..
[6] Robert A. van de Geijn,et al. SuperMatrix: a multithreaded runtime scheduling system for algorithms-by-blocks , 2008, PPoPP.
[7] Hal Finkel,et al. Benchmarking and Evaluating Unified Memory for OpenMP GPU Offloading , 2017, LLVM-HPC@SC.
[8] Yannis Cotronis,et al. A quantitative roofline model for GPU kernel performance estimation using micro-benchmarks and hardware metric profiling , 2017, J. Parallel Distributed Comput..
[9] Hiroyuki Sato,et al. Linear Performance-Breakdown Model: A Framework for GPU kernel programs performance analysis , 2015, Int. J. Netw. Comput..
[10] Javier Cuenca,et al. Tuning basic Linear Algebra Routines for Hybrid CPU+GPU Platforms , 2014, ICCS.
[11] Kim M. Hazelwood,et al. Where is the data? Why you cannot debate CPU vs. GPU performance without the answer , 2011, (IEEE ISPASS) IEEE INTERNATIONAL SYMPOSIUM ON PERFORMANCE ANALYSIS OF SYSTEMS AND SOFTWARE.
[12] Jason Maassen,et al. Performance Models for CPU-GPU Data Transfers , 2014, 2014 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.
[13] David R. Kaeli,et al. Exploring the multiple-GPU design space , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.
[14] Yi Yang,et al. BLASX: A High Performance Level-3 BLAS Library for Heterogeneous Multi-GPU Computing , 2015, ICS.
[15] Jiayuan Meng,et al. Improving GPU Performance Prediction with Data Transfer Modeling , 2013, 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum.
[16] Thierry Gautier,et al. XKBlas: a High Performance Implementation of BLAS-3 Kernels on Multi-GPU Server , 2020, 2020 28th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP).
[17] George Bosilca,et al. Hierarchical DAG Scheduling for Hybrid Distributed Systems , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium.
[18] Zheng Gong,et al. Software pipelining for graphic processing unit acceleration: Partition, scheduling and granularity , 2016, Int. J. High Perform. Comput. Appl..
[19] William Gropp,et al. An adaptive performance modeling tool for GPU architectures , 2010, PPoPP '10.
[20] Venkatram Vishwanath,et al. GROPHECY: GPU performance projection from CPU code skeletons , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[21] Torsten Hoefler,et al. Performance modeling for systematic performance tuning , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[22] Jinjun Xiong,et al. Evaluating Characteristics of CUDA Communication Primitives on High-Bandwidth Interconnects , 2019, ICPE.
[23] Jack Dongarra,et al. Faster, Cheaper, Better { a Hybridization Methodology to Develop Linear Algebra Software for GPUs , 2010 .
[24] Mahmoud Naghibzadeh,et al. Comparison of analytical and ML-based models for predicting CPU–GPU data transfer time , 2020, Computing.
[25] Scott B. Baden,et al. Modeling and predicting performance of high performance computing applications on hardware accelerators , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum.
[26] Jack J. Dongarra,et al. Towards dense linear algebra for hybrid GPU accelerated manycore systems , 2009, Parallel Comput..
[27] Paolo Bientinesi,et al. Performance Modeling for Dense Linear Algebra , 2012, 2012 SC Companion: High Performance Computing, Networking Storage and Analysis.
[28] Yao Zhang,et al. A quantitative performance analysis model for GPU architectures , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.
[29] Hyesoon Kim,et al. An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness , 2009, ISCA '09.