GPU code optimization using abstract kernel emulation and sensitivity analysis
暂无分享,去创建一个
Sriram Krishnamoorthy | P. Sadayappan | Aravind Sukumaran-Rajam | Fabrice Rastello | Jinsung Kim | Louis-Noël Pouchet | Prashant Singh Rawat | Changwan Hong | S. Krishnamoorthy | P. Sadayappan | F. Rastello | L. Pouchet | Aravind Sukumaran-Rajam | P. Rawat | Jinsung Kim | Changwan Hong
[1] Mingyu Chen,et al. Understanding the GPU Microarchitecture to Achieve Bare-Metal Performance Tuning , 2017, PPoPP.
[2] Apan Qasem,et al. Automatic Restructuring of GPU Kernels for Exploiting Inter-thread Data Locality , 2012, CC.
[3] Sriram Krishnamoorthy,et al. Optimizing tensor contraction expressions for hybrid CPU-GPU execution , 2013, Cluster Computing.
[4] Albert Cohen,et al. Hybrid Hexagonal/Classical Tiling for GPUs , 2014, CGO '14.
[5] Michael F. P. O'Boyle,et al. Automatic optimization of thread-coarsening for graphics processors , 2014, 2014 23rd International Conference on Parallel Architecture and Compilation (PACT).
[6] Jianliang Xu,et al. GPURoofline: A Model for Guiding Performance Optimizations on GPUs , 2012, Euro-Par.
[7] Richard W. Vuduc,et al. A performance analysis framework for identifying potential benefits in GPGPU applications , 2012, PPoPP '12.
[8] André Seznec,et al. Performance upper bound analysis and optimization of SGEMM on Fermi and Kepler GPUs , 2013, Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).
[9] P. Sadayappan,et al. Resource conscious reuse-driven tiling for GPUs , 2016, 2016 International Conference on Parallel Architecture and Compilation Techniques (PACT).
[10] Christian Terboven,et al. OpenACC - First Experiences with Real-World Applications , 2012, Euro-Par.
[11] Mike O'Connor,et al. Cache-Conscious Wavefront Scheduling , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.
[12] Paulius Micikevicius,et al. Fusing convolution kernels through tiling , 2015, ARRAY@PLDI.
[13] Jack Dongarra,et al. Report on the Sunway TaihuLight System , 2016 .
[14] Yao Zhang,et al. A quantitative performance analysis model for GPU architectures , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.
[15] Guangwen Yang,et al. Taming the "Monster": Overcoming Program Optimization Challenges on SW26010 Through Precise Performance Modeling , 2018, 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS).
[16] Tjerk P. Straatsma,et al. NWChem: A comprehensive and scalable open-source solution for large scale molecular simulations , 2010, Comput. Phys. Commun..
[17] GPU code optimization using abstract kernel emulation and sensitivity analysis , 2018, PLDI.
[18] J. Ramanujam,et al. Automatic C-to-CUDA Code Generation for Affine Programs , 2010, CC.
[19] Uday Bondhugula,et al. A practical automatic polyhedral parallelizer and locality optimizer , 2008, PLDI '08.
[20] Helmar Burkhart,et al. PATUS: A Code Generation and Autotuning Framework for Parallel Iterative Stencil Computations on Modern Microarchitectures , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.
[21] L. Dagum,et al. OpenMP: an industry standard API for shared-memory programming , 1998 .
[22] Hyesoon Kim,et al. An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness , 2009, ISCA '09.
[23] Francky Catthoor,et al. Polyhedral parallel code generation for CUDA , 2013, TACO.
[24] William Gropp,et al. An adaptive performance modeling tool for GPU architectures , 2010, PPoPP '10.
[25] P. Sadayappan,et al. High-performance code generation for stencil computations on GPU architectures , 2012, ICS '12.
[26] Mark N. Wegman,et al. Constant propagation with conditional branches , 1985, POPL.
[27] Samuel Williams,et al. Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.
[28] Sriram Krishnamoorthy,et al. Acceleration of Streamed Tensor Contraction Expressions on GPGPU-Based Clusters , 2010, 2010 IEEE International Conference on Cluster Computing.
[29] Albert Cohen,et al. Split tiling for GPUs: automatic parallelization using trapezoidal tiles , 2013, GPGPU@ASPLOS.
[30] Seyong Lee,et al. OpenARC: open accelerator research compiler for directive-based, efficient heterogeneous computing , 2014, HPDC '14.
[31] Isaiah Shavitt,et al. Many-Body Methods in Chemistry and Physics: MBPT and Coupled-Cluster Theory , 2009 .
[32] Michael F. P. O'Boyle,et al. A large-scale cross-architecture evaluation of thread-coarsening , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[33] Seyong Lee,et al. COMPASS: A Framework for Automated Performance Modeling and Prediction , 2015, ICS.
[34] Chaowei Wang,et al. A performance analysis framework for exploiting GPU microarchitectural capability , 2017, ICS '17.
[35] Shoaib Kamil,et al. OpenTuner: An extensible framework for program autotuning , 2014, 2014 23rd International Conference on Parallel Architecture and Compilation (PACT).