Generating and auto-tuning parallel stencil codes
暂无分享,去创建一个
[1] Pradeep Dubey,et al. 3.5-D Blocking Optimization for Stencil Computations on Modern CPUs and GPUs , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.
[2] Palash Sarkar,et al. A brief history of cellular automata , 2000, CSUR.
[3] Piet Hut,et al. A hierarchical O(N log N) force-calculation algorithm , 1986, Nature.
[4] David E. Keyes,et al. Exaflop/s: The why and the how , 2011 .
[5] James Demmel,et al. Minimizing Communication for Eigenproblems and the Singular Value Decomposition , 2010, ArXiv.
[6] David G. Wonnacott,et al. Using time skewing to eliminate idle time due to memory bandwidth and network limitations , 2000, Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000.
[7] Chau-Wen Tseng,et al. Tiling Optimizations for 3D Scientific Computations , 2000, ACM/IEEE SC 2000 Conference (SC'00).
[8] Andrew V. Goldberg,et al. PHAST: Hardware-Accelerated Shortest Path Trees , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.
[9] Steven G. Johnson,et al. The Design and Implementation of FFTW3 , 2005, Proceedings of the IEEE.
[10] Robert W. Numrich,et al. Co-array Fortran for parallel programming , 1998, FORF.
[11] Zeljko Vujaskovic,et al. Re-setting the biologic rationale for thermal therapy , 2005, International journal of hyperthermia : the official journal of European Society for Hyperthermic Oncology, North American Hyperthermia Group.
[12] Rob H. Bisseling,et al. Cache-Oblivious Sparse Matrix--Vector Multiplication by Using Sparse Matrix Partitioning Methods , 2009, SIAM J. Sci. Comput..
[13] Jack Dongarra,et al. Numerical linear algebra on emerging architectures: The PLASMA and MAGMA projects , 2009 .
[14] Ulrich Rüde,et al. A framework that supports in writing performance-optimized stencil-based codes , 2010 .
[15] Uday Bondhugula,et al. A practical automatic polyhedral parallelizer and locality optimizer , 2008, PLDI '08.
[16] Franz Franchetti,et al. SPIRAL: Code Generation for DSP Transforms , 2005, Proceedings of the IEEE.
[17] Helmar Burkhart,et al. PATUS: A Code Generation and Autotuning Framework for Parallel Iterative Stencil Computations on Modern Microarchitectures , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.
[18] Ken Kennedy,et al. Practical dependence testing , 1991, PLDI '91.
[19] Monica S. Lam,et al. Efficient and exact data dependence analysis , 1991, PLDI '91.
[20] Matthias Müller-Hannemann,et al. Algorithm Engineering: Bridging the Gap between Algorithm Theory and Practice [outcome of a Dagstuhl Seminar] , 2010, Algorithm Engineering.
[21] Sanjay Ghemawat,et al. MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.
[22] Ken Kennedy,et al. Optimizing Compilers for Modern Architectures: A Dependence-based Approach , 2001 .
[23] Kevin M. Lepak,et al. Cache Hierarchy and Memory Subsystem of the AMD Opteron Processor , 2010, IEEE Micro.
[24] John A. Nelder,et al. A Simplex Method for Function Minimization , 1965, Comput. J..
[25] James Demmel,et al. Communication-optimal Parallel and Sequential QR and LU Factorizations , 2008, SIAM J. Sci. Comput..
[26] Samuel Williams,et al. Automatic Thread-Level Parallelization in the Chombo AMR Library , 2011 .
[27] Jonathan Walpole,et al. Is Parallel Programming Hard, And If So, Why? , 2009 .
[28] Martin Griebl,et al. Code generation in the polytope model , 1998, Proceedings. 1998 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.98EX192).
[29] Esra Neufeld,et al. Manycore Stencil Computations in Hyperthermia Applications , 2010, Scientific Computing with Multicore and Accelerators.
[30] H. H. Penns. Analysis of tissue and arterial blood temperatures in the resting human forearm , 1948 .
[31] E. Neufeld,et al. The HYPERcollar: A novel applicator for hyperthermia in the head and neck , 2007 .
[32] George Karypis,et al. Introduction to Parallel Computing Solution Manual , 2003 .
[33] Margarethus M. Paulides,et al. A head and neck hyperthermia applicator: Theoretical antenna array design , 2007, International journal of hyperthermia : the official journal of European Society for Hyperthermic Oncology, North American Hyperthermia Group.
[34] Jean-Pierre Berenger,et al. A perfectly matched layer for the absorption of electromagnetic waves , 1994 .
[35] Samuel Williams,et al. Optimization and Performance Modeling of Stencil Computations on Modern Microprocessors , 2007, SIAM Rev..
[36] Sanjit A. Seshia,et al. Sketching stencils , 2007, PLDI '07.
[37] Hans-Peter Seidel,et al. Cache oblivious parallelograms in iterative stencil computations , 2010, ICS '10.
[38] Rudolf Eigenmann,et al. PEAK—a fast and effective performance tuning system via compiler optimization orchestration , 2008, TOPL.
[39] Rudolf Eigenmann,et al. Programming Distributed Memory Sytems Using OpenMP , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.
[40] William J. Dally,et al. A tuning framework for software-managed memory hierarchies , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).
[41] Kenneth A. Hawick,et al. Auto-generation of Parallel Finite-Differencing Code for MPI, TBB and CUDA , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.
[42] Mitsuhisa Sato,et al. The Omni OpenMP Compiler on the Distributed Shared Memory of Cenju-4 , 2001, WOMPAT.
[43] Bradford L. Chamberlain,et al. Parallel Programmability and the Chapel Language , 2007, Int. J. High Perform. Comput. Appl..
[44] Michael Garland,et al. Efficient Sparse Matrix-Vector Multiplication on CUDA , 2008 .
[45] James Demmel,et al. Minimizing communication in sparse matrix solvers , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.
[46] Christian Blum,et al. Metaheuristics in combinatorial optimization: Overview and conceptual comparison , 2003, CSUR.
[47] Zhiyuan Li,et al. Automatic tiling of iterative stencil loops , 2004, TOPL.
[48] Michael E. Wolf,et al. Improving locality and parallelism in nested loops , 1992 .
[49] C. J. Price,et al. A hybrid Hooke and Jeeves|Direct method for non-smooth optimization. , 2009 .
[50] Vivek Sarkar,et al. Report on the Experimental Language X10 , 2006 .
[51] Peter Messmer,et al. Accelerating Stencil-Based Computations by Increased Temporal Locality on Modern Multi- and Many-Core Architectures , 2008 .
[52] E. Neufeld. High Resolution Hyperthermia Treatment Planning , 2008 .
[53] C. Mack,et al. Seeing double , 2021, Nature Catalysis.
[54] Robert J. Fowler,et al. Increasing Temporal Locality with Skewing and Recursive Blocking , 2001, International Conference on Software Composition.
[55] Helmar Burkhart,et al. Automatic code generation and tuning for stencil kernels on modern shared memory architectures , 2011, Computer Science - Research and Development.
[56] Benjamin Hindman,et al. Lithe: enabling efficient composition of parallel libraries , 2009 .
[57] Yousef Saad,et al. Iterative methods for sparse linear systems , 2003 .
[58] Chun Chen,et al. Loop Transformation Recipes for Code Generation and Auto-Tuning , 2009, LCPC.
[59] Jack J. Dongarra,et al. Towards dense linear algebra for hybrid GPU accelerated manycore systems , 2009, Parallel Comput..
[60] J. Tukey,et al. An algorithm for the machine calculation of complex Fourier series , 1965 .
[61] Kenneth Steiglitz,et al. Combinatorial Optimization: Algorithms and Complexity , 1981 .
[62] Edsger W. Dijkstra,et al. A note on two problems in connexion with graphs , 1959, Numerische Mathematik.
[63] Rudolf Eigenmann,et al. OpenMPC: Extended OpenMP Programming and Tuning for GPUs , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.
[64] Helmar Burkhart,et al. Run, Stencil, Run! – A Comparison of Modern Parallel Programming Paradigms , 2011 .
[65] J. Zee. Heating the patient : a promising approach ? , 2002 .
[66] Master Gardener,et al. Mathematical games: the fantastic combinations of john conway's new solitaire game "life , 1970 .
[67] Alan Edelman,et al. Language and compiler support for auto-tuning variable-accuracy algorithms , 2011, International Symposium on Code Generation and Optimization (CGO 2011).
[68] Samuel Williams,et al. Auto-Tuning Stencil Computations on Multicore and Accelerators , 2010, Scientific Computing with Multicore and Accelerators.
[69] M. J. D. Powell,et al. An efficient method for finding the minimum of a function of several variables without calculating derivatives , 1964, Comput. J..
[70] Antoine Petitet,et al. Minimizing development and maintenance costs in supporting persistently optimized BLAS , 2005 .
[71] Dhabaleswar K. Panda,et al. Scalable Earthquake Simulation on Petascale Supercomputers , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.
[72] Samuel Williams,et al. An auto-tuning framework for parallel multicore stencil computations , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).
[73] Laxmikant V. Kalé,et al. Architectural Constraints to Attain 1 Exaflop/s for Three Scientific Application Classes , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.
[74] G. Amdhal,et al. Validity of the single processor approach to achieving large scale computing capabilities , 1967, AFIPS '67 (Spring).
[75] Helmar Burkhart,et al. Implementing the ALWAN Communication and Data Distribution Library Using PVM , 1996, PVM.
[76] John Shalf,et al. SEJITS: Getting Productivity and Performance With Selective Embedded JIT Specialization , 2010 .
[77] E Neufeld,et al. Novel conformal technique to reduce staircasing artifacts at material boundaries for FDTD modeling of the bioheat equation , 2007, Physics in medicine and biology.
[78] G.E. Moore,et al. Cramming More Components Onto Integrated Circuits , 1998, Proceedings of the IEEE.
[79] Robert Frank,et al. ALWAN: A Skeleton Programming Language , 1996, COORDINATION.
[80] Samuel Williams,et al. Auto-tuning performance on multicore computers , 2008 .
[81] Monica S. Lam,et al. A data locality optimizing algorithm , 1991, PLDI '91.
[82] M. Berger,et al. Adaptive mesh refinement for hyperbolic partial differential equations , 1982 .
[83] C. D. Perttunen,et al. Lipschitzian optimization without the Lipschitz constant , 1993 .
[84] Zhiyuan Li,et al. New tiling techniques to improve cache temporal locality , 1999, PLDI '99.
[85] Gerhard Wellein,et al. Efficient Temporal Blocking for Stencil Computations by Multicore-Aware Wavefront Parallelization , 2009, 2009 33rd Annual IEEE International Computer Software and Applications Conference.
[86] Volker Strumpen,et al. Cache oblivious stencil computations , 2005, ICS '05.
[87] Jack J. Dongarra,et al. Automatically Tuned Linear Algebra Software , 1998, Proceedings of the IEEE/ACM SC98 Conference.
[88] Scott B. Baden,et al. Mint: realizing CUDA performance in 3D stencil methods with annotated C , 2011, ICS '11.
[89] Helmar Burkhart,et al. PATUS: A Code Generation and Auto-Tuning Framework For Parallel Stencil Computations , 2011 .
[90] Wilhelm Burger,et al. Digital Image Processing - An Algorithmic Introduction using Java , 2016, Texts in Computer Science.
[91] Monica S. Lam,et al. A Loop Transformation Theory and an Algorithm to Maximize Parallelism , 1991, IEEE Trans. Parallel Distributed Syst..
[92] John L. Gustafson,et al. Reevaluating Amdahl's law , 1988, CACM.
[93] Katherine A. Yelick,et al. Titanium: A High-performance Java Dialect , 1998, Concurr. Pract. Exp..
[94] G. Nemhauser,et al. Integer Programming , 2020 .
[95] David G. Wonnacott,et al. Time Skewing for Parallel Computers , 1999, LCPC.
[96] Leslie Greengard,et al. A fast algorithm for particle simulations , 1987 .
[97] Pen-Chung Yew,et al. Some results on exact data dependence analysis , 1990 .
[98] Volker Strumpen,et al. The Cache Complexity of Multithreaded Cache Oblivious Algorithms , 2009, SPAA '06.
[99] Samuel Williams,et al. TORCH Computational Reference Kernels - A Testbed for Computer Science Research , 2010 .
[100] Andrew Lumsdaine,et al. Single-Source Shortest Paths with the Parallel Boost Graph Library , 2006, The Shortest Path Problem.
[101] Bradford L. Chamberlain,et al. The case for high-level parallel programming in ZPL , 1998 .
[102] Kunle Olukotun,et al. Language virtualization for heterogeneous parallel computing , 2010, OOPSLA.
[103] Samuel Williams,et al. Scientific Computing Kernels on the Cell Processor , 2007, International Journal of Parallel Programming.
[104] Utpal Banerjee,et al. Dependence analysis for supercomputing , 1988, The Kluwer international series in engineering and computer science.
[105] Peter Schlag,et al. Clinical use of the hyperthermia treatment planning system HyperPlan to predict effectiveness and toxicity. , 2003, International journal of radiation oncology, biology, physics.
[106] Bradley C. Kuszmaul,et al. The pochoir stencil compiler , 2011, SPAA '11.
[107] Samuel Williams,et al. The Landscape of Parallel Computing Research: A View from Berkeley , 2006 .
[108] C. Dimitrakopoulos,et al. 100 GHz Transistors from Wafer Scale Epitaxial Graphene , 2010, 1002.3845.
[109] C. T. Kelley,et al. Modifications of the direct algorithm , 2001 .
[110] S. Weinbaum,et al. A new simplified bioheat equation for the effect of blood flow on local average tissue temperature. , 1985, Journal of biomechanical engineering.
[111] Katherine Yelick,et al. OSKI: A library of automatically tuned sparse matrix kernels , 2005 .
[112] V. Torczon,et al. Direct search methods: then and now , 2000 .
[113] P. Wust,et al. Hyperthermia in combined treatment of cancer. , 2002, The Lancet Oncology.
[114] Zhiyuan Li,et al. Data dependence analysis on multi-dimensional array references , 1989, ICS '89.
[115] Frank Lemke,et al. High-density active optical cable: from a new concept to a prototype , 2011, OPTO.
[116] Christoforos E. Kozyrakis,et al. RAMP: Research Accelerator for Multiple Processors , 2007, IEEE Micro.
[117] William Pugh,et al. The Omega test: A fast and practical integer programming algorithm for dependence analysis , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).
[118] David F. Bacon,et al. Compiler transformations for high-performance computing , 1994, CSUR.
[119] Gerhard Wellein,et al. Efficient multicore-aware parallelization strategies for iterative stencil computations , 2010, J. Comput. Sci..
[120] Steve Carr,et al. Unroll-and-jam using uniformly generated sets , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.
[121] William J. Dally,et al. Sequoia: Programming the Memory Hierarchy , 2006, International Conference on Software Composition.
[122] 장훈,et al. [서평]「Computer Organization and Design, The Hardware/Software Interface」 , 1997 .
[123] Toshiyuki Shimizu,et al. Tofu: A 6D Mesh/Torus Interconnect for Exascale Computers , 2009, Computer.
[124] Peter Messmer,et al. Parallel data-locality aware stencil computations on modern micro-architectures , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.
[125] James Demmel,et al. Communication-optimal Parallel and Sequential Cholesky Decomposition , 2009, SIAM J. Sci. Comput..
[126] Arie van Deursen,et al. Domain-specific languages: an annotated bibliography , 2000, SIGP.
[127] Michael Gschwind. Chip multiprocessing and the cell broadband engine , 2006, CF '06.
[128] Ananta Tiwari,et al. Online Adaptive Code Generation and Tuning , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.
[129] Sanjay V. Rajopadhye,et al. Parameterized tiled loops for free , 2007, PLDI '07.
[130] Kevin Skadron,et al. A Performance Study for Iterative Stencil Loops on GPUs with Ghost Zone Optimizations , 2011, International Journal of Parallel Programming.
[131] James Demmel,et al. Communication avoiding Gaussian elimination , 2008, HiPC 2008.
[132] Robert Hooke,et al. `` Direct Search'' Solution of Numerical and Statistical Problems , 1961, JACM.
[133] Luis A. Dalguer,et al. Staggered-grid split-node method for spontaneous rupture simulation , 2007 .
[134] Rudolf Eigenmann,et al. Cetus: A Source-to-Source Compiler Infrastructure for Multicores , 2009, Computer.
[135] Ron Cytron,et al. Interprocedural dependence analysis and parallelization , 1986, SIGP.
[136] Leonid Oliker,et al. Towards Ultra-High Resolution Models of Climate and Weather , 2008, Int. J. High Perform. Comput. Appl..
[137] Robert Strzodka,et al. Time skewing made simple , 2011, PPoPP '11.
[138] John Randal Allen,et al. Dependence analysis for subscripted variables and its application to program transformations , 1983 .
[139] John D. McCalpin,et al. Time Skewing: A Value-Based Approach to Optimizing for Memory Locality , 1999 .
[140] David A. Bader,et al. Parallel Shortest Path Algorithms for Solving Large-Scale Instances , 2006, The Shortest Path Problem.
[141] John Paul Strachan,et al. The switching location of a bipolar memristor: chemical, thermal and structural mapping , 2011, Nanotechnology.