暂无分享,去创建一个
[1] Allen D. Malony,et al. Autotuning GPU Kernels via Static and Predictive Analysis , 2017, 2017 46th International Conference on Parallel Processing (ICPP).
[2] Kevin Skadron,et al. Scalable parallel programming , 2008, 2008 IEEE Hot Chips 20 Symposium (HCS).
[3] Xinxin Mei,et al. Dissecting GPU Memory Hierarchy Through Microbenchmarking , 2015, IEEE Transactions on Parallel and Distributed Systems.
[4] Wen-mei W. Hwu,et al. Program optimization space pruning for a multithreaded gpu , 2008, CGO '08.
[5] Changbo Chen,et al. Basic Polynomial Algebra Subprograms , 2015, ACCA.
[6] Hyesoon Kim,et al. An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness , 2009, ISCA '09.
[7] Richard W. Vuduc,et al. A performance analysis framework for identifying potential benefits in GPGPU applications , 2012, PPoPP '12.
[8] J. Little. A Proof for the Queuing Formula: L = λW , 1961 .
[9] Jack J. Dongarra,et al. Automatically Tuned Linear Algebra Software , 1998, Proceedings of the IEEE/ACM SC98 Conference.
[10] Vasily Volkov,et al. Understanding Latency Hiding on GPUs , 2016 .
[11] Alfred V. Aho,et al. Compilers: Principles, Techniques, and Tools , 1986, Addison-Wesley series in computer science / World student series edition.
[12] Andreas Moshovos,et al. Demystifying GPU microarchitecture through microbenchmarking , 2010, 2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS).
[13] Wen-mei W. Hwu,et al. Optimization principles and application performance evaluation of a multithreaded GPU using CUDA , 2008, PPoPP.
[14] Sudhakar Yalamanchili,et al. Ocelot: A dynamic optimization framework for bulk-synchronous applications in heterogeneous systems , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).
[15] Marc Moreno Maza,et al. A Many-Core Machine Model for Designing Algorithms with Minimum Parallelism Overheads , 2014, PARCO.
[16] Jacqueline Chame,et al. A script-based autotuning compiler system to generate high-performance CUDA code , 2013, TACO.
[17] Jack Dongarra,et al. LAPACK Users' Guide, 3rd ed. , 1999 .
[18] K. Chung,et al. On Lattices Admitting Unique Lagrange Interpolations , 1977 .
[19] Lin Ma,et al. A Memory Access Model for Highly-threaded Many-core Architectures , 2012, 2012 IEEE 18th International Conference on Parallel and Distributed Systems.
[20] Phillip B. Gibbons. A more practical PRAM model , 1989, SPAA '89.
[21] Peter J. Olver,et al. OnMultivariate Interpolation , 2003 .
[22] Robert M. Corless,et al. A Graduate Introduction to Numerical Methods , 2013 .
[23] Hsien-Hsin S. Lee,et al. GPUMech: GPU Performance Modeling Technique Based on Interval Analysis , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.
[24] Steven G. Johnson,et al. FFTW: an adaptive software architecture for the FFT , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).
[25] Vasily Volkov. A microbenchmark to study GPU performance models , 2018, PPOPP.
[26] H. Hong. An improvement of the projection operator in cylindrical algebraic decomposition , 1990, ISSAC '90.
[27] Bernhard Beckermann,et al. The condition number of real Vandermonde, Krylov and positive definite Hankel matrices , 2000, Numerische Mathematik.
[28] Todd J Martínez,et al. Automated Code Engine for Graphical Processing Units: Application to the Effective Core Potential Integrals and Gradients. , 2016, Journal of chemical theory and computation.
[29] Lifan Xu,et al. Auto-tuning a high-level language targeted to GPU codes , 2012, 2012 Innovative Parallel Computing (InPar).
[30] Alex Brandt,et al. High Performance Sparse Multivariate Polynomials: Fundamental Data Structures and Algorithms , 2018 .
[31] Michael Franz,et al. Continuous program optimization: A case study , 2003, TOPL.
[32] José M. F. Moura,et al. Spiral: A Generator for Platform-Adapted Libraries of Signal Processing Alogorithms , 2004, Int. J. High Perform. Comput. Appl..
[33] William Gropp,et al. An adaptive performance modeling tool for GPU architectures , 2010, PPoPP '10.
[34] Uzi Vishkin,et al. Simulation of Parallel Random Access Machines by Circuits , 1984, SIAM J. Comput..
[35] D. Eisenbud. Commutative Algebra: with a View Toward Algebraic Geometry , 1995 .