Using hardware performance counters to speed up autotuning convergence on GPUs
暂无分享,去创建一个
[1] Margaret Martonosi,et al. Starchart: Hardware and software optimization using recursive partitioning regression trees , 2013, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques.
[2] Chris Cummins,et al. End-to-End Deep Learning of Optimization Heuristics , 2017, 2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT).
[3] Yannis Cotronis,et al. A Practical Performance Model for Compute and Memory Bound GPU Kernels , 2015, 2015 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing.
[4] Derek Chiou,et al. GPGPU performance and power estimation using machine learning , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).
[5] Henk Corporaal,et al. Transit: A Visual Analytical Model for Multithreaded Machines , 2015, HPDC.
[6] José María Carazo,et al. A GPU acceleration of 3-D Fourier reconstruction in cryo-EM , 2019, Int. J. High Perform. Comput. Appl..
[7] Jan Fousek,et al. Exploiting historical data: Pruning autotuning spaces and estimating the number of tuning steps , 2020, Concurr. Comput. Pract. Exp..
[8] Henry Wong,et al. Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.
[9] Apan Qasem,et al. Maximizing Hardware Prefetch Effectiveness with Machine Learning , 2015, 2015 IEEE 17th International Conference on High Performance Computing and Communications, 2015 IEEE 7th International Symposium on Cyberspace Safety and Security, and 2015 IEEE 12th International Conference on Embedded Software and Systems.
[10] Tarek S. Abdelrahman,et al. A Sampling Based Strategy to Automatic Performance Tuning of GPU Programs , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).
[11] Hsien-Hsin S. Lee,et al. GPUMech: GPU Performance Modeling Technique Based on Interval Analysis , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.
[12] Shoaib Kamil,et al. OpenTuner: An extensible framework for program autotuning , 2014, 2014 23rd International Conference on Parallel Architecture and Compilation (PACT).
[13] Sergei Gorlatch,et al. ATF: A generic directive‐based auto‐tuning framework , 2019, Concurr. Comput. Pract. Exp..
[14] Michael F. P. O'Boyle,et al. Rapidly Selecting Good Compiler Optimizations using Performance Counters , 2007, International Symposium on Code Generation and Optimization (CGO'07).
[15] Peng Zhang,et al. Auto-tuning Streamed Applications on Intel Xeon Phi , 2018, 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS).
[16] Jiri Filipovic,et al. Autotuning of OpenCL Kernels with Global Optimizations , 2017, ANDARE '17.
[17] Simon D. Hammond,et al. Revisiting Online Autotuning for Sparse-Matrix Vector Multiplication Kernels on Next-Generation Architectures , 2017, 2017 IEEE 19th International Conference on High Performance Computing and Communications; IEEE 15th International Conference on Smart City; IEEE 3rd International Conference on Data Science and Systems (HPCC/SmartCity/DSS).
[18] Siegfried Benkner,et al. A Benchmark Set of Highly-efficient CUDA and OpenCL Kernels and its Dynamic Autotuning with Kernel Tuning Toolkit , 2020, Future Gener. Comput. Syst..
[19] Cedric Nugteren,et al. CLTune: A Generic Auto-Tuner for OpenCL Kernels , 2015, 2015 IEEE 9th International Symposium on Embedded Multicore/Many-core Systems-on-Chip.
[20] Vasily Volkov,et al. Understanding Latency Hiding on GPUs , 2016 .
[21] Yao Zhang,et al. A quantitative performance analysis model for GPU architectures , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.
[22] Jack J. Dongarra,et al. Autotuning GEMM Kernels for the Fermi GPU , 2012, IEEE Transactions on Parallel and Distributed Systems.
[23] Amin Nezarat,et al. Searching CUDA code autotuning spaces with hardware performance counters: data from benchmarks running on various GPU architectures , 2021, Data in Brief.
[24] Anne C. Elster,et al. Machine Learning Based Auto-Tuning for Enhanced OpenCL Performance Portability , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium Workshop.
[25] Shiao-Li Tsao,et al. Efficient and Portable Workgroup Size Tuning , 2020, IEEE Transactions on Parallel and Distributed Systems.
[26] Simon McIntosh-Smith,et al. Improving Auto-Tuning Convergence Times with Dynamically Generated Predictive Performance Models , 2015, 2015 IEEE 9th International Symposium on Embedded Multicore/Many-core Systems-on-Chip.
[27] Jack J. Dongarra,et al. A comparison of search heuristics for empirical code optimization , 2008, 2008 IEEE International Conference on Cluster Computing.
[28] Prasanna Balaprakash,et al. Exploiting Performance Portability in Search Algorithms for Autotuning , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).
[29] Eduardo Cesar Galobardes,et al. Automatic Tuning of HPC Applications. The Periscope Tuning Framework , 2015 .
[30] Apan Qasem,et al. Automatically Selecting Profitable Thread Block Sizes for Accelerated Kernels , 2017, 2017 IEEE 19th International Conference on High Performance Computing and Communications; IEEE 15th International Conference on Smart City; IEEE 3rd International Conference on Data Science and Systems (HPCC/SmartCity/DSS).
[31] J. Ross Quinlan,et al. Induction of Decision Trees , 1986, Machine Learning.
[32] David Cox,et al. Input-Aware Auto-Tuning of Compute-Bound HPC Kernels , 2017, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.
[33] Ben van Werkhoven,et al. Kernel Tuner: A search-optimizing GPU code auto-tuner , 2019, Future Gener. Comput. Syst..
[34] Cees T. A. M. de Laat,et al. The landscape of GPGPU performance modeling tools , 2016, Parallel Comput..
[35] Sally A. McKee,et al. Prediction-based power estimation and scheduling for CMPs , 2009, ICS '09.
[36] Hyesoon Kim,et al. An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness , 2009, ISCA '09.
[37] Anna Sikora,et al. Hardware Counters' Space Reduction for Code Region Characterization , 2019, Euro-Par.
[38] Cedric Nugteren,et al. CLBlast: A Tuned OpenCL BLAS Library , 2017, IWOCL.
[39] Prasanna Balaprakash,et al. Can search algorithms save large-scale automatic performance tuning? , 2011, ICCS.