Portable and productive high-performance computing
暂无分享,去创建一个
[1] Allen Taflove,et al. Computational Electrodynamics the Finite-Difference Time-Domain Method , 1995 .
[2] Shoaib Kamil,et al. OpenTuner: An extensible framework for program autotuning , 2014, 2014 23rd International Conference on Parallel Architecture and Compilation (PACT).
[3] Gene H. Golub,et al. Cyclic Reduction - History and Applications , 1997 .
[4] David E. Keyes,et al. Optimization of an Electromagnetics Code with Multicore Wavefront Diamond Blocking and Multi-dimensional Intra-Tile Parallelization , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS).
[5] Samuel Williams,et al. Optimization and Performance Modeling of Stencil Computations on Modern Microprocessors , 2007, SIAM Rev..
[6] Liu Peng,et al. High-order stencil computations on multicore clusters , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.
[7] James F. Epperson,et al. An Introduction to Numerical Methods and Analysis , 2001 .
[8] Fred G. Gustavson,et al. A New Parallel Algorithm for Tridiagonal Symmetric Positive Definite Systems of Equations , 1996, PARA.
[9] Guy E. Blelloch,et al. GraphChi: Large-Scale Graph Computation on Just a PC , 2012, OSDI.
[10] Uday Bondhugula,et al. Tiling and optimizing time-iterated computations over periodic domains , 2014, 2014 23rd International Conference on Parallel Architecture and Compilation (PACT).
[11] Guy E. Blelloch,et al. Prefix sums and their applications , 1990 .
[12] Hee-Seok Kim,et al. A scalable, numerically stable, high-performance tridiagonal solver using GPUs , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.
[13] Uday Bondhugula,et al. Automatic Transformations for Communication-Minimized Parallelization and Locality Optimization in the Polyhedral Model , 2008, CC.
[14] Uday Bondhugula,et al. A practical automatic polyhedral parallelizer and locality optimizer , 2008, PLDI '08.
[15] Helmar Burkhart,et al. PATUS: A Code Generation and Autotuning Framework for Parallel Iterative Stencil Computations on Modern Microarchitectures , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.
[16] Sanjay Ghemawat,et al. MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.
[17] Samuel Williams,et al. Optimization of a lattice Boltzmann computation on state-of-the-art multicore platforms , 2009, J. Parallel Distributed Comput..
[18] Zhiyuan Li,et al. New tiling techniques to improve cache temporal locality , 1999, PLDI '99.
[19] Rainer Bleck,et al. Salinity-driven Thermocline Transients in a Wind- and Thermohaline-forced Isopycnic Coordinate Model of the North Atlantic , 1992 .
[20] Volker Strumpen,et al. The cache complexity of multithreaded cache oblivious algorithms , 2006, SPAA.
[21] Aleksandar Zlateski. A design and implementation of an efficient, parallel watershed algorithm for affinity graphs , 2011 .
[22] A. Nakano,et al. Multiresolution molecular dynamics algorithm for realistic materials modeling on parallel computers , 1994 .
[23] Wei Shyy,et al. Lattice Boltzmann Method for 3-D Flows with Curved Boundary , 2000 .
[24] Paulius Micikevicius,et al. 3D finite difference computation on GPUs using CUDA , 2009, GPGPU-2.
[25] Gilles Bertrand,et al. Watershed Cuts: Minimum Spanning Forests and the Drop of Water Principle , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.
[26] Leonid Oliker,et al. Impact of modern memory subsystems on cache optimizations for stencil computations , 2005, MSP '05.
[27] Matteo Frigo,et al. A fast Fourier transform compiler , 1999, SIGP.
[28] Charles E. Leiserson,et al. The Cilk++ concurrency platform , 2009, 2009 46th ACM/IEEE Design Automation Conference.
[29] José M. F. Moura,et al. Spiral: A Generator for Platform-Adapted Libraries of Signal Processing Alogorithms , 2004, Int. J. High Perform. Comput. Appl..
[30] Weiqiang Wang,et al. A Multilevel Parallelization Framework for High-Order Stencil Computations , 2009, Euro-Par.
[31] Samuel Williams,et al. Implicit and explicit optimizations for stencil computations , 2006, MSPC '06.
[32] Peter S. Pacheco. Parallel programming with MPI , 1996 .
[33] Volker Strumpen,et al. Cache oblivious stencil computations , 2005, ICS '05.
[34] Jack J. Dongarra,et al. Automatically Tuned Linear Algebra Software , 1998, Proceedings of the IEEE/ACM SC98 Conference.
[35] Harold S. Stone,et al. An Efficient Parallel Algorithm for the Solution of a Tridiagonal Linear System of Equations , 1973, JACM.
[36] W. Donald Frazer,et al. Samplesort: A Sampling Approach to Minimal Storage Tree Sorting , 1970, JACM.
[37] Katherine Yelick,et al. OSKI: A library of automatically tuned sparse matrix kernels , 2005 .
[38] Robert E. Tarjan,et al. Data structures and network algorithms , 1983, CBMS-NSF regional conference series in applied mathematics.
[39] Hans Burkhardt,et al. A Parallel Watershed Algorithm , 1996 .
[40] Guy E. Blelloch,et al. A comparison of sorting algorithms for the connection machine CM-2 , 1991, SPAA '91.
[41] Maryam Mehri Dehnavi,et al. Autotuning divide‐and‐conquer stencil computations , 2017, Concurr. Comput. Pract. Exp..
[42] Douglas Stott Parker,et al. Map-reduce-merge: simplified relational data processing on large clusters , 2007, SIGMOD '07.
[43] Weiqiang Wang,et al. In-Core Optimization of High-Order Stencil Computations , 2009, PDPTA.
[44] Chau-Wen Tseng,et al. Tiling Optimizations for 3D Scientific Computations , 2000, ACM/IEEE SC 2000 Conference (SC'00).
[45] Payut Pantawongdecha. Autotuning divide-and-conquer matrix-vector multiplication , 2016 .
[46] Steven G. Johnson,et al. The Design and Implementation of FFTW3 , 2005, Proceedings of the IEEE.
[47] Samuel Williams,et al. An auto-tuning framework for parallel multicore stencil computations , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).
[48] Matteo Frigo,et al. Cache-oblivious algorithms , 1999, 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039).
[49] Robert E. Tarjan,et al. Efficiency of a Good But Not Linear Set Union Algorithm , 1972, JACM.
[50] Bradley C. Kuszmaul,et al. The pochoir stencil compiler , 2011, SPAA '11.
[51] Samuel Williams,et al. The Landscape of Parallel Computing Research: A View from Berkeley , 2006 .
[52] Samuel Williams,et al. Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.
[53] I-Hsin Chung,et al. Active Harmony: Towards Automated Performance Tuning , 2002, ACM/IEEE SC 2002 Conference (SC'02).
[54] David E. Keyes,et al. Multicore-Optimized Wavefront Diamond Blocking for Optimizing Stencil Updates , 2014, SIAM J. Sci. Comput..
[55] Shoaib Ashraf Kamil,et al. Productive High Performance Parallel Programming with Auto-tuned Domain-Specific Embedded Languages , 2012 .
[56] Uday Bondhugula,et al. Effective automatic parallelization of stencil computations , 2007, PLDI '07.
[57] Ralph Johnson,et al. design patterns elements of reusable object oriented software , 2019 .