Program Optimization Strategies for Data-Parallel Many-Core Processors
暂无分享,去创建一个
[1] Mikhail J. Atallah,et al. Algorithms and Theory of Computation Handbook , 2009, Chapman & Hall/CRC Applied Algorithms and Data Structures series.
[2] Mary Lou Soffa,et al. An approach to ordering optimizing transformations , 1990, PPOPP '90.
[3] Herbert H. J. Hum,et al. Compilation, architectural support, and evaluation of SIMD graphics pipeline programs on a general-purpose CPU , 2003, 2003 12th International Conference on Parallel Architectures and Compilation Techniques.
[4] Steven R. Vegdahl. Phase coupling and constant generation in an optimizing microcode compiler , 1982, MICRO 15.
[5] Daniel Jiménez-González,et al. Performance Analysis of Cell Broadband Engine for High Memory Bandwidth Applications , 2007, 2007 IEEE International Symposium on Performance Analysis of Systems & Software.
[6] Y. N. Srikant,et al. Microarchitecture Sensitive Empirical Models for Compiler Optimizations , 2007, International Symposium on Code Generation and Optimization (CGO'07).
[7] Barbara M. Chapman,et al. Supercompilers for parallel and vector computers , 1990, ACM Press frontier series.
[8] S. Asano,et al. The design and implementation of a first-generation CELL processor , 2005, ISSCC. 2005 IEEE International Digest of Technical Papers. Solid-State Circuits Conference, 2005..
[9] Milind Girkar,et al. EXOCHI: architecture and programming environment for a heterogeneous multi-core multithreaded system , 2007, PLDI '07.
[10] Keith D. Cooper,et al. Order Matters : Exploring the Structure of the Space of Compilation Sequences Using Randomized Search Algorithms † , 2004 .
[11] Peter M. W. Knijnenburg,et al. Automatic selection of compiler options using non-parametric inferential statistics , 2005, 14th International Conference on Parallel Architectures and Compilation Techniques (PACT'05).
[12] Ken Kennedy,et al. Optimizing Compilers for Modern Architectures: A Dependence-based Approach , 2001 .
[13] David I. August,et al. Compiler optimization-space exploration , 2003, International Symposium on Code Generation and Optimization, 2003. CGO 2003..
[14] Sharad Malik,et al. Precise miss analysis for program transformations with caches of arbitrary associativity , 1998, ASPLOS VIII.
[15] Lawrence Rauchwerger,et al. Polaris: The Next Generation in Parallelizing Compilers , 2000 .
[16] David Tarditi,et al. Accelerator: using data parallelism to program GPUs for general-purpose uses , 2006, ASPLOS XII.
[17] Michael Wolfe,et al. Iteration Space Tiling for Memory Hierarchies , 1987, PPSC.
[18] D.A. Reed,et al. An Integrated Compilation and Performance Analysis Environment for Data Parallel Programs , 1995, Proceedings of the IEEE/ACM SC95 Conference.
[19] Kenneth E. Iverson,et al. A programming language , 1899, AIEE-IRE '62 (Spring).
[20] Ken Kennedy,et al. Optimizing for parallelism and data locality , 1992, ICS '92.
[21] Klaus Schulten,et al. GPU acceleration of cutoff pair potentials for molecular modeling applications , 2008, CF '08.
[22] Ken Kennedy,et al. Improving register allocation for subscripted variables , 1990, PLDI '90.
[23] Ken Kennedy,et al. PFC: A Program to Convert Fortran to Parallel Form , 1982 .
[24] Ken Kennedy,et al. Automatic translation of FORTRAN programs to vector form , 1987, TOPL.
[25] Monica S. Lam,et al. The cache performance and optimizations of blocked algorithms , 1991, ASPLOS IV.
[26] Michael E. Wolf,et al. Combining Loop Transformations Considering Caches and Scheduling , 2004, International Journal of Parallel Programming.
[27] Sarita V. Adve,et al. Code transformations to improve memory parallelism , 1999, MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture.
[28] Wen-mei W. Hwu,et al. A systematic approach to delivering instruction-level parallelism in epic systems , 2005 .
[29] Marc Snir,et al. Automatic tuning matrix multiplication performance on graphics hardware , 2005, 14th International Conference on Parallel Architectures and Compilation Techniques (PACT'05).
[30] Brian Fahs,et al. Microarchitecture optimizations for exploiting memory-level parallelism , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..
[31] N.K. Govindaraju,et al. A Memory Model for Scientific Algorithms on Graphics Processors , 2006, ACM/IEEE SC 2006 Conference (SC'06).
[32] Alfred V. Aho,et al. Compilers: Principles, Techniques, and Tools (2nd Edition) , 2006 .
[33] Chau-Wen Tseng,et al. Compiler optimizations for improving data locality , 1994, ASPLOS VI.
[34] Hsueh-Ming Hang,et al. Motion Estimation for Video Coding Standards , 1997, J. VLSI Signal Process..
[35] William R. Mark,et al. Cg: a system for programming graphics hardware in a C-like language , 2003, ACM Trans. Graph..
[36] Jerrold L. Wagener,et al. Fortran 90 Handbook: Complete Ansi/Iso Reference , 1992 .
[37] Wen-mei W. Hwu,et al. Program optimization space pruning for a multithreaded gpu , 2008, CGO '08.
[38] Steven S. Muchnick,et al. Advanced Compiler Design and Implementation , 1997 .
[39] Chau-Wen Tseng,et al. Software Support For Improving Locality in Scientific Codes , 2001 .
[40] David Kirk,et al. NVIDIA cuda software and gpu parallel computing architecture , 2007, ISMM '07.
[41] Wen-mei W. Hwu,et al. Field-testing IMPACT EPIC research results in Itanium 2 , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..
[42] Monica S. Lam,et al. Maximizing Multiprocessor Performance with the SUIF Compiler , 1996, Digit. Tech. J..
[43] François Bodin,et al. Improving cache behavior of dynamically allocated data structures , 1998, Proceedings. 1998 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.98EX192).
[44] José M. F. Moura,et al. Spiral: A Generator for Platform-Adapted Libraries of Signal Processing Alogorithms , 2004, Int. J. High Perform. Comput. Appl..
[45] Ken Kennedy,et al. Automatic decomposition of scientific programs for parallel execution , 1987, POPL '87.
[46] John L. Klepeis,et al. Anton, a special-purpose machine for molecular dynamics simulation , 2007, ISCA '07.
[47] Vivek Sarkar,et al. A general framework for iteration-reordering loop transformations , 1992, PLDI '92.
[48] Zhaohui Du,et al. Data and computation transformations for Brook streaming applications on multiprocessors , 2006, International Symposium on Code Generation and Optimization (CGO'06).
[49] Pat Hanrahan,et al. Understanding the efficiency of GPU algorithms for matrix-matrix multiplication , 2004, Graphics Hardware.
[50] Uday Bondhugula,et al. Automatic data movement and computation mapping for multi-level parallel architectures with explicitly managed memories , 2008, PPoPP.
[51] Michael F. P. O'Boyle,et al. Using machine learning to focus iterative optimization , 2006, International Symposium on Code Generation and Optimization (CGO'06).
[52] P. Slusallek,et al. RPU: a programmable ray processing unit for realtime ray tracing , 2005, SIGGRAPH '05.
[53] L. Almagor,et al. Finding effective compilation sequences , 2004, LCTES '04.
[54] Michael Metcalf,et al. High performance Fortran , 1995 .
[55] Mary Lou Soffa,et al. Predicting the impact of optimizations for embedded systems , 2003, LCTES '03.
[56] Ken Kennedy,et al. The memory of bandwidth bottleneck and its amelioration by a compiler , 2000, Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000.
[57] Rajiv Gupta,et al. Register Pressure Sensitive Redundancy Elimination , 1999, CC.
[58] Keith D. Cooper,et al. Combining analyses, combining optimizations , 1995, TOPL.
[59] Michael D. McCool,et al. Performance evaluation of GPUs using the RapidMind development platform , 2006, SC.
[60] Justin P. Haldar,et al. Accelerating advanced mri reconstructions on gpus , 2008, CF '08.
[61] Alexander V. Veidenbaum,et al. EFFECTS OF PROGRAM RESTRUCTURING, ALGORITHM CHANGE, AND ARCHITECTURE CHOICE ON PROGRAM PERFORMANCE. , 1984 .
[62] D. Burger,et al. Memory Bandwidth Limitations of Future Microprocessors , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).
[63] Klaus Schulten,et al. Accelerating Molecular Modeling Applications with GPU Computing , 2009 .
[64] Ken Kennedy,et al. Improving effective bandwidth through compiler enhancement of global cache reuse , 2001, Proceedings 15th International Parallel and Distributed Processing Symposium. IPDPS 2001.
[65] Ken Kennedy,et al. Automatic data layout for distributed-memory machines , 1998, TOPL.
[66] Vikram S. Adve,et al. Compiler Support for Analysis and Tuning Data Parallel Programs , 1995 .
[67] Michael F. P. O'Boyle,et al. Combined Selection of Tile Sizes and Unroll Factors Using Iterative Compilation , 2004, The Journal of Supercomputing.
[68] François Irigoin,et al. Supernode partitioning , 1988, POPL '88.
[69] James R. Larus,et al. Cache-conscious structure definition , 1999, PLDI '99.
[70] Gary S. Tyson,et al. Evaluating Heuristic Optimization Phase Order Search Algorithms , 2007, International Symposium on Code Generation and Optimization (CGO'07).
[71] Keith D. Cooper,et al. Optimizing for reduced code space using genetic algorithms , 1999, LCTES '99.
[72] Jens H. Krüger,et al. A Survey of General‐Purpose Computation on Graphics Hardware , 2007, Eurographics.
[73] Kevin Skadron,et al. A performance study of general-purpose applications on graphics processors using CUDA , 2008, J. Parallel Distributed Comput..
[74] Ken Kennedy,et al. Bandwidth-Based Performance Tuning and Prediction , 1999 .