Compiler Optimizations and Attuning for Stencils and Geometric Multigrid
暂无分享,去创建一个
[1] Samuel Williams,et al. Compiler-Directed Transformation for Higher-Order Stencils , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium.
[2] Samuel Williams,et al. Extracting ultra-scale Lattice Boltzmann performance via hierarchical and distributed auto-tuning , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[3] Guy L. Steele,et al. Fortran at ten gigaflops: the connection machine convolution compiler , 1991, PLDI '91.
[4] L. Collatz. The numerical treatment of differential equations , 1961 .
[5] Scott B. Baden,et al. Mint: realizing CUDA performance in 3D stencil methods with annotated C , 2011, ICS '11.
[6] I. S. Gradshtein,et al. THE ELEMENTS OF MATHEMATICAL LOGIC , 1963 .
[7] Catherine Mills Olschanowsky,et al. A Study on Balancing Parallelism, Data Locality, and Recomputation in Existing PDE Solvers , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.
[8] Zhiyuan Li,et al. New tiling techniques to improve cache temporal locality , 1999, PLDI '99.
[9] Paulius Micikevicius,et al. 3D finite difference computation on GPUs using CUDA , 2009, GPGPU-2.
[10] Frédo Durand,et al. Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines , 2013, PLDI 2013.
[11] Samuel Williams,et al. An auto-tuning framework for parallel multicore stencil computations , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).
[12] Bradley C. Kuszmaul,et al. The pochoir stencil compiler , 2011, SPAA '11.
[13] Franz Franchetti,et al. Data Layout Transformation for Stencil Computations on Short-Vector SIMD Architectures , 2011, CC.
[14] Mary W. Hall,et al. CHiLL : A Framework for Composing High-Level Loop Transformations , 2007 .
[15] Helmar Burkhart,et al. PATUS: A Code Generation and Autotuning Framework for Parallel Iterative Stencil Computations on Modern Microarchitectures , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.
[16] Siddhartha Chatterjee,et al. Cache-Efficient Multigrid Algorithms , 2001, Int. J. High Perform. Comput. Appl..
[17] Ken Kennedy,et al. Optimizing Compilers for Modern Architectures: A Dependence-based Approach , 2001 .
[18] Chun Chen,et al. Polyhedra scanning revisited , 2012, PLDI.
[19] Chau-Wen Tseng,et al. Tiling Optimizations for 3D Scientific Computations , 2000, ACM/IEEE SC 2000 Conference (SC'00).
[20] Mary W. Hall,et al. Towards making autotuning mainstream , 2013, Int. J. High Perform. Comput. Appl..
[21] Jacqueline Chame,et al. A script-based autotuning compiler system to generate high-performance CUDA code , 2013, TACO.
[22] Collin McCurdy,et al. The Scalable Heterogeneous Computing (SHOC) benchmark suite , 2010, GPGPU-3.
[23] Samuel Williams,et al. Converting Stencils to Accumulations Forcommunication-Avoiding Optimizationin Geometric Multigrid , 2014 .
[24] Steven J. Deitz,et al. Eliminating redundancies in sum-of-product array computations , 2001, ICS '01.
[25] Jie Cheng,et al. Programming Massively Parallel Processors. A Hands-on Approach , 2010, Scalable Comput. Pract. Exp..
[26] J. Shalf,et al. Evaluation of Cache-based Superscalar and Cacheless Vector Architectures for Scientific Computations , 2003, ACM/IEEE SC 2003 Conference (SC'03).
[27] François Irigoin,et al. Supernode partitioning , 1988, POPL '88.
[28] Samuel Williams,et al. The potential of the cell processor for scientific computing , 2005, CF '06.
[29] Samuel Williams,et al. Roofline: An Insightful Visual Performance Model for Floating-Point Programs and Multicore Architectures , 2008 .
[30] Paul Feautrier,et al. Dataflow analysis of array and scalar references , 1991, International Journal of Parallel Programming.
[31] José María Cela,et al. Introducing the Semi-stencil Algorithm , 2009, PPAM.
[32] Samuel Williams,et al. Optimization and Performance Modeling of Stencil Computations on Modern Microprocessors , 2007, SIAM Rev..
[33] Jan Treibig,et al. Efficiency improvements of iterative numerical algorithms on modern architectures , 2008 .
[34] John D. McCalpin,et al. Time Skewing: A Value-Based Approach to Optimizing for Memory Locality , 1999 .
[35] Jason Cong,et al. Polyhedral-based data reuse optimization for configurable computing , 2013, FPGA '13.
[36] Yi Guo,et al. The habanero multicore software research project , 2009, OOPSLA Companion.
[37] Phillip Colella,et al. A Fourth-Order Accurate Finite-Volume Method with Structured Adaptive Mesh Refinement for Solving the Advection-Diffusion Equation , 2012, SIAM J. Sci. Comput..
[38] G. Wellein,et al. Introducing a parallel cache oblivious blocking approach for the lattice Boltzmann method , 2008 .
[39] Uday Bondhugula,et al. Effective automatic parallelization of stencil computations , 2007, PLDI '07.
[40] Michael Wolfe,et al. More iteration space tiling , 1989, Proceedings of the 1989 ACM/IEEE Conference on Supercomputing (Supercomputing '89).
[41] Wim Vanroose,et al. Improving the arithmetic intensity of multigrid with the help of polynomial smoothers , 2012, Numer. Linear Algebra Appl..
[42] Naoya Maruyama,et al. Optimizing Stencil Computations for NVIDIA Kepler GPUs , 2014 .
[43] Samuel Williams,et al. Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.
[44] David G. Wonnacott,et al. Using time skewing to eliminate idle time due to memory bandwidth and network limitations , 2000, Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000.
[45] Samuel Williams,et al. Optimization of geometric multigrid for emerging multi- and manycore processors , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.
[46] Gerhard Wellein,et al. Efficient Temporal Blocking for Stencil Computations by Multicore-Aware Wavefront Parallelization , 2009, 2009 33rd Annual IEEE International Computer Software and Applications Conference.
[47] Xing Zhou,et al. Hierarchical overlapped tiling , 2012, CGO '12.
[48] Pradeep Dubey,et al. 3.5-D Blocking Optimization for Stencil Computations on Modern CPUs and GPUs , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.
[49] P. Sadayappan,et al. High-performance code generation for stencil computations on GPU architectures , 2012, ICS '12.
[50] Samuel Williams,et al. Lattice Boltzmann simulation optimization on leading multicore platforms , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.
[51] Alan Edelman,et al. Autotuning multigrid with PetaBricks , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.
[52] Ulrich Rüde,et al. Cache Optimization for Structured and Unstructured Grid Multigrid , 2000 .
[53] Samuel Williams,et al. Compiler generation and autotuning of communication-avoiding operators for geometric multigrid , 2013, 20th Annual International Conference on High Performance Computing.
[54] Chun Chen,et al. Loop Transformation Recipes for Code Generation and Auto-Tuning , 2009, LCPC.
[55] Ken Kennedy,et al. Optimizing for parallelism and data locality , 1992, ICS '92.
[56] William Pugh,et al. The Omega test: A fast and practical integer programming algorithm for dependence analysis , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).
[57] J. Ramanujam,et al. A framework for enhancing data reuse via associative reordering , 2014, PLDI.
[58] Frank Mueller,et al. Auto-generation and auto-tuning of 3D stencil codes on GPU clusters , 2012, CGO '12.