RT-CUDA: A Software Tool for CUDA Code Restructuring
暂无分享,去创建一个
Mayez A. Al-Mouhamed | Muhammed Al-Mulhem | Ayaz ul Hasan Khan | Adel F. Ahmed | M. Al-Mouhamed | A. Khan | M. Al-Mulhem | Adel Ahmed
[1] James Demmel,et al. Benchmarking GPUs to tune dense linear algebra , 2008, HiPC 2008.
[2] Mark S. Peercy,et al. A performance-oriented data parallel virtual machine for GPUs , 2006, SIGGRAPH '06.
[3] Yi Yang,et al. A GPGPU compiler for memory optimization and parallelism management , 2010, PLDI '10.
[4] Fumihiko Ino,et al. A code motion technique for accelerating general-purpose computation on the GPU , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.
[5] David A. Patterson,et al. Computer Architecture, Fifth Edition: A Quantitative Approach , 2011 .
[6] David A. Patterson,et al. Computer Architecture: A Quantitative Approach , 1969 .
[7] Adrian Jackson,et al. Dynamic Loop Parallelisation , 2012, ArXiv.
[8] Yi Yang,et al. The Implementation of a High Performance GPGPU Compiler , 2012, International Journal of Parallel Programming.
[9] C Cedric Nugteren,et al. Improving the Programmability of GPU Architectures , 2014 .
[10] Vincent Rijmen,et al. The Design of Rijndael: AES - The Advanced Encryption Standard , 2002 .
[11] Vincent Rijmen,et al. The Design of Rijndael , 2002, Information Security and Cryptography.
[12] Timothy A. Davis,et al. The university of Florida sparse matrix collection , 2011, TOMS.
[13] Andrei P. Ershov. On programming of arithmetic operations , 1958, CACM.
[14] Volodymyr Kindratenko,et al. Porting Optimized GPU Kernels to a Multi-core CPU: Computational Quantum Chemistry Application Example , 2011, 2011 Symposium on Application Accelerators in High-Performance Computing.
[15] Brian Vinter,et al. An Efficient GPU General Sparse Matrix-Matrix Multiplication for Irregular Data , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.
[16] Bronis R. de Supinski,et al. OpenMP for Accelerators , 2011, IWOMP.
[17] Nazeeruddin Mohammad,et al. Erratum to: Optimizing the Matrix Multiplication Using Strassen and Winograd Algorithms with Limited Recursions on Many-Core , 2015, International Journal of Parallel Programming.
[18] L. Dagum,et al. OpenMP: an industry standard API for shared-memory programming , 1998 .
[19] Nicholas Wilt,et al. The CUDA Handbook: A Comprehensive Guide to GPU Programming , 2013 .
[20] Guibin Wang,et al. Coordinate strip-mining and kernel fusion to lower power consumption on GPU , 2011, 2011 Design, Automation & Test in Europe.
[21] William J. Dally,et al. A Hierarchical Thread Scheduler and Register File for Energy-Efficient Throughput Processors , 2012, TOCS.
[22] Wen-mei W. Hwu,et al. Optimization principles and application performance evaluation of a multithreaded GPU using CUDA , 2008, PPoPP.
[23] Jacqueline Chame,et al. A script-based autotuning compiler system to generate high-performance CUDA code , 2013, TACO.
[24] Philippas Tsigas,et al. The Synchronization Power of Coalesced Memory Accesses , 2010, IEEE Transactions on Parallel and Distributed Systems.
[25] Chun Chen,et al. Combining models and guided empirical search to optimize for multiple levels of the memory hierarchy , 2005, International Symposium on Code Generation and Optimization.
[26] G. D. Peterson,et al. Power Aware Computing on GPUs , 2012, 2012 Symposium on Application Accelerators in High Performance Computing.
[27] Akiyoshi Wakatani. Effectiveness of a strip-mining approach for VQ image coding using GPGPU implementation , 2009, 2009 24th International Conference Image and Vision Computing New Zealand.
[28] Uday Bondhugula,et al. A practical automatic polyhedral parallelizer and locality optimizer , 2008, PLDI '08.
[29] Rudolf Eigenmann,et al. OpenMPC: extended OpenMP for efficient programming and tuning on GPUs , 2013, Int. J. Comput. Sci. Eng..
[30] Mayez A. Al-Mouhamed,et al. Padding Free Bank Conflict Resolution for CUDA-Based Matrix Transpose Algorithm , 2014, Int. J. Networked Distributed Comput..
[31] Henk Corporaal,et al. Compile-time GPU memory access optimizations , 2010, 2010 International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation.
[32] Benoît Meister,et al. A mapping path for multi-GPGPU accelerated computers from a portable high level programming abstraction , 2010, GPGPU-3.
[33] Hyeran Jeon,et al. Graph processing on GPUs: Where are the bottlenecks? , 2014, 2014 IEEE International Symposium on Workload Characterization (IISWC).
[34] Wu-chun Feng,et al. Inter-block GPU communication via fast barrier synchronization , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).
[35] Wen-mei W. Hwu,et al. CUDA-Lite: Reducing GPU Programming Complexity , 2008, LCPC.
[36] Roy H. Campbell,et al. Plasma: Shared Memory Dynamic Allocation and Bank-Conflict-Free Access in GPUs , 2012, 2012 41st International Conference on Parallel Processing Workshops.
[37] Pat Hanrahan,et al. Brook for GPUs: stream computing on graphics hardware , 2004, SIGGRAPH 2004.
[38] Zhaohui Du,et al. Data and computation transformations for Brook streaming applications on multiprocessors , 2006, International Symposium on Code Generation and Optimization (CGO'06).
[39] Tarek S. Abdelrahman,et al. hiCUDA: a high-level directive-based language for GPU programming , 2009, GPGPU-2.
[40] Rudolf Eigenmann,et al. OpenMP to GPGPU: a compiler framework for automatic translation and optimization , 2009, PPoPP '09.
[41] P. Sadayappan,et al. Optimal loop unrolling for GPGPU programs , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).