Understanding Latency Hiding on GPUs
暂无分享,去创建一个
[1] J. Little. A Proof for the Queuing Formula: L = λW , 1961 .
[2] David A. Patterson,et al. Computer Architecture: A Quantitative Approach , 1969 .
[3] Edward D. Lazowska,et al. Quantitative system performance - computer system analysis using queueing network models , 1983, Int. CMG Conference.
[4] David E. Culler,et al. Analysis of multithreaded architectures for parallel computing , 1990, SPAA '90.
[5] H. Levy,et al. An architecture for software-controlled data prefetching , 1991, [1991] Proceedings. The 18th Annual International Symposium on Computer Architecture.
[6] Raj Jain,et al. The art of computer systems performance analysis - techniques for experimental design, measurement, simulation, and modeling , 1991, Wiley professional computing.
[7] Anoop Gupta,et al. Design and evaluation of a compiler algorithm for prefetching , 1992, ASPLOS V.
[8] Anant Agarwal,et al. Performance Tradeoffs in Multithreaded Processors , 1992, IEEE Trans. Parallel Distributed Syst..
[9] James R. Goodman,et al. Memory Bandwidth Limitations of Future Microprocessors , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).
[10] D. Bailey. Little ’ s Law and High Performance Computing , 1997 .
[11] William J. Dally,et al. Principles and Practices of Interconnection Networks , 2004 .
[12] Sally A. McKee,et al. Reflections on the memory wall , 2004, CF '04.
[13] Mark J. Harris. Mapping computational concepts to GPUs , 2005, SIGGRAPH Courses.
[14] Stephen C. Graves,et al. Little's Law , 2008 .
[15] Erik Lindholm,et al. NVIDIA Tesla: A Unified Graphics and Computing Architecture , 2008, IEEE Micro.
[16] Kevin Skadron,et al. Scalable parallel programming , 2008, 2008 IEEE Hot Chips 20 Symposium (HCS).
[17] James Demmel,et al. Benchmarking GPUs to tune dense linear algebra , 2008, HiPC 2008.
[18] John R. Nickolls,et al. Scalable parallel programming , 2008 .
[19] V. Volkov,et al. Fitting FFT onto the G 80 Architecture , 2008 .
[20] Samuel Williams,et al. Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.
[21] Hyesoon Kim,et al. An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness , 2009, ISCA '09.
[22] Timo Aila,et al. Understanding the efficiency of ray traversal on GPUs , 2009, High Performance Graphics.
[23] Tor M. Aamodt,et al. A first-order fine-grained multithreaded throughput model , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.
[24] K. Srinathan,et al. A performance prediction model for the CUDA GPGPU platform , 2009, 2009 International Conference on High Performance Computing (HiPC).
[25] Andreas Moshovos,et al. Demystifying GPU microarchitecture through microbenchmarking , 2010, 2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS).
[26] William J. Dally,et al. The GPU Computing Era , 2010, IEEE Micro.
[27] Avi Mendelson,et al. Threads vs. caches: Modeling the behavior of parallel workloads , 2010, 2010 IEEE International Conference on Computer Design.
[28] William Gropp,et al. An adaptive performance modeling tool for GPU architectures , 2010, PPoPP '10.
[29] Andrew S. Grimshaw,et al. Revisiting sorting for GPGPU stream architectures , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).
[30] Emmett Kilgariff,et al. Fermi GF100 GPU Architecture , 2011, IEEE Micro.
[31] Andreas Resios. GPU performance prediction using parametrized models , 2011 .
[32] David A. Patterson,et al. Computer Architecture, Fifth Edition: A Quantitative Approach , 2011 .
[33] David Simchi-Levi,et al. Introduction to "Little's Law as Viewed on Its 50th Anniversary" , 2011, Oper. Res..
[34] Yao Zhang,et al. A quantitative performance analysis model for GPU architectures , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.
[35] William J. Dally,et al. Energy-efficient mechanisms for managing thread context in throughput processors , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).
[36] Wen-mei W. Hwu,et al. Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing , 2012 .
[37] André Seznec,et al. Break down GPU execution time with an analytical method , 2012, RAPIDO '12.
[38] Richard W. Vuduc,et al. A performance analysis framework for identifying potential benefits in GPGPU applications , 2012, PPoPP '12.
[39] Lin Ma,et al. A Memory Access Model for Highly-threaded Many-core Architectures , 2012, 2012 IEEE 18th International Conference on Parallel and Distributed Systems.
[40] André Seznec,et al. Performance upper bound analysis and optimization of SGEMM on Fermi and Kepler GPUs , 2013, Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).
[41] Massimiliano Fatica,et al. CUDA Fortran for Scientists and Engineers: Best Practices for Efficient CUDA Fortran Programming , 2013 .
[42] Shuaiwen Song,et al. A Simplified and Accurate Model of Power-Performance Efficiency on Emergent GPU Architectures , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.
[43] Henk Corporaal,et al. A detailed GPU cache model based on reuse distance theory , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).
[44] Hsien-Hsin S. Lee,et al. GPUMech: GPU Performance Modeling Technique Based on Interval Analysis , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.
[45] Hiroyuki Sato,et al. Linear Performance-Breakdown Model: A Framework for GPU kernel programs performance analysis , 2015, Int. J. Netw. Comput..
[46] Jungwon Kim,et al. A Performance Model for GPUs with Caches , 2015, IEEE Transactions on Parallel and Distributed Systems.
[47] Xinxin Mei,et al. Dissecting GPU Memory Hierarchy Through Microbenchmarking , 2015, IEEE Transactions on Parallel and Distributed Systems.
[48] AngryCalc. NVIDIA GeForce GTX 1050 Ti , 2018 .