Multicore-Optimized Wavefront Diamond Blocking for Optimizing Stencil Updates
暂无分享,去创建一个
David E. Keyes | Gerhard Wellein | Georg Hager | Hatem Ltaief | Tareq M. Malas | Holger Stengel | D. Keyes | G. Wellein | H. Ltaief | G. Hager | T. Malas | H. Stengel
[1] Willi Jäger,et al. High Performance Computing in Science and Engineering ’02 , 2003 .
[2] Guang R. Gao,et al. Locality Optimization of Stencil Applications Using Data Dependency Graphs , 2010, LCPC.
[3] Ulrich Rüde,et al. Challenges and Potentials of Emerging Multicore Architectures , 2009 .
[4] Gerhard Wellein,et al. Quantum Transport within a Background Medium: Fluctuations versus Correlations , 2009 .
[5] Richard W. Vuduc,et al. A Roofline Model of Energy , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.
[6] Pradeep Dubey,et al. 3.5-D Blocking Optimization for Stencil Computations on Modern CPUs and GPUs , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.
[7] Gerhard Wellein,et al. Parallel Sparse Matrix-Vector Multiplication as a Test Case for Hybrid MPI+OpenMP Programming , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.
[8] Ulrich Rüde,et al. Performance Evaluation of Parallel Large-Scale Lattice Boltzmann Applications on Three Supercomputing Architectures , 2004, Proceedings of the ACM/IEEE SC2004 Conference.
[9] Gerhard Wellein,et al. Performance Patterns and Hardware Metrics on Modern Multicore Processors: Best Practices for Performance Engineering , 2012, Euro-Par Workshops.
[10] Chun Chen,et al. Hierarchical parallelization and optimization of high-order stencil computations on multicore clusters , 2012, The Journal of Supercomputing.
[11] Gabriel Wittum,et al. Competence in High Performance Computing 2010 , 2012, Springer Berlin Heidelberg.
[12] Rolf Rannacher,et al. Modeling, Simulation and Optimization of Complex Processes: Proceedings of the International Conference on High Performance Scientific Computing, March 10-14, 2003, Hanoi, Vietnam , 2005 .
[13] D. Wonnacott,et al. On the Scalability of Loop Tiling Techniques , 2012 .
[14] Siegfried Wagner,et al. High Performance Computing in Science and Engineering, Munich 2004 , 2008 .
[15] Satoshi Matsuoka,et al. Physis: An implicitly parallel programming model for stencil computations on large-scale GPU-accelerated supercomputers , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[16] Gerhard Wellein,et al. Multicore-aware parallel temporal blocking of stencil codes for shared and distributed memory , 2009, 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW).
[17] Gerhard Wellein,et al. Exact Diagonalization Results for Strongly Correlated Electron-Phonon Systems , 2002 .
[18] David E. Keyes,et al. Towards energy efficiency and maximum computational intensity for stencil algorithms using wavefront diamond temporal blocking , 2014, ArXiv.
[19] Gerhard Wellein,et al. Towards the Limits of present-day Supercomputers: Exact Diagonalization of Strongly Correlated Electron-Phonon Systems , 2000 .
[20] Gerhard Wellein,et al. Comparing the performance of different x86 SIMD instruction sets for a medical imaging application on modern multi- and manycore chips , 2014, WPMVP '14.
[21] Gerhard Wellein,et al. Jacobi-Davidson Algorithm with Fast Matrix-Vector Multiplikation on Massively Parallel and Vector Supercomputers , 2001 .
[22] Gerhard Wellein,et al. Prospects for truly asynchronous communication with pure MPI and hybrid MPI/OpenMP on current supercomputing platforms , 2011 .
[23] Hans-Peter Seidel,et al. Cache Accurate Time Skewing in Iterative Stencil Computations , 2011, 2011 International Conference on Parallel Processing.
[24] Hager Georg,et al. Optimization Techniques for Modern High Performance Computers , 2008 .
[25] Albert Cohen,et al. Hybrid Hexagonal/Classical Tiling for GPUs , 2014, CGO '14.
[26] Katherine Yelick,et al. Auto-tuning stencil codes for cache-based multicore platforms , 2009 .
[27] Uday Bondhugula,et al. A practical automatic polyhedral parallelizer and locality optimizer , 2008, PLDI '08.
[28] David G. Wonnacott,et al. Using time skewing to eliminate idle time due to memory bandwidth and network limitations , 2000, Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000.
[29] Helmar Burkhart,et al. PATUS: A Code Generation and Autotuning Framework for Parallel Iterative Stencil Computations on Modern Microarchitectures , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.
[30] Michael M. Resch,et al. High performance computing in science and engineering , 2005, 17th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD'05).
[31] A. R. Bishop,et al. Spatiotemporal evolution of polaronic states in finite quantum systems , 2010, 1008.1864.
[32] Gerhard Wellein,et al. Introduction to High Performance Computing for Scientists and Engineers , 2010, Chapman and Hall / CRC computational science series.
[33] Willi Jäger,et al. High Performance Computing in Science and Engineering ’99 , 2000 .
[34] Albert Cohen,et al. The Relation Between Diamond Tiling and Hexagonal Tiling , 2014, Parallel Process. Lett..
[35] Guang R. Gao,et al. Mapping the FDTD Application to Many-Core Chip Architectures , 2009, 2009 International Conference on Parallel Processing.
[36] Leslie Lamport,et al. The parallel execution of DO loops , 1974, CACM.
[37] Scott B. Baden,et al. Mint: realizing CUDA performance in 3D stencil methods with annotated C , 2011, ICS '11.
[38] Willi Jäger,et al. High Performance Computing in Science and Engineering ’01 , 2002, Springer Berlin Heidelberg.
[39] Gerhard Wellein,et al. Pseudo-Vectorization and RISC Optimization Techniques for the Hitachi SR8000 Architecture , 2003 .
[40] Guang R. Gao,et al. Jagged Tiling for Intra-tile Parallelism and Fine-Grain Multithreading , 2014, LCPC.
[41] Gerhard Wellein,et al. Direct Numerical Simulation of Turbulent Flow Over Dimples – Code Optimization for NEC SX-8 plus Flow Results , 2008 .
[42] Samuel Williams,et al. Optimization and Performance Modeling of Stencil Computations on Modern Microprocessors , 2007, SIAM Rev..
[43] Hans-Peter Seidel,et al. Cache oblivious parallelograms in iterative stencil computations , 2010, ICS '10.
[44] Gerhard Wellein,et al. Exact Numerical Treatment of Finite Quantum Systems Using Leading-Edge Supercomputers , 2003, HPSC.
[45] Gerhard Wellein,et al. Leveraging Shared Caches for Parallel Temporal Blocking of Stencil Codes on Multicore Processors and Clusters , 2010, Parallel Process. Lett..
[46] Xing Zhou,et al. Tiling optimizations for stencil computations , 2013 .
[47] Gerhard Wellein,et al. Density-Matrix Algorithm for Phonon Hilbert Space Reduction in the Numerical Diagonalization of Quantum Many-Body Systems , 2002 .
[48] Uday Bondhugula,et al. Tiling stencil computations to maximize parallelism , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.
[49] Gerhard Wellein,et al. Exploring performance and power properties of modern multi‐core chips via simple machine models , 2012, Concurr. Comput. Pract. Exp..
[50] Richard Veras,et al. A stencil compiler for short-vector SIMD architectures , 2013, ICS '13.
[51] Gerhard Wellein,et al. Efficient Temporal Blocking for Stencil Computations by Multicore-Aware Wavefront Parallelization , 2009, 2009 33rd Annual IEEE International Computer Software and Applications Conference.
[52] Volker Strumpen,et al. Cache oblivious stencil computations , 2005, ICS '05.
[53] Gerhard Wellein,et al. One-Dimensional Electron-Phonon Systems: Mott- Versus Peierls-Insulators , 2003 .
[54] Bradley C. Kuszmaul,et al. The pochoir stencil compiler , 2011, SPAA '11.
[55] Gerhard Wellein,et al. RZBENCH: Performance evaluation of current HPC architectures using low-level and application benchmarks , 2007, ArXiv.
[56] Gerhard Wellein,et al. Quantifying Performance Bottlenecks of Stencil Computations Using the Execution-Cache-Memory Model , 2014, ICS.
[57] Gerhard Wellein,et al. Have the Vectors the Continuing Ability to Parry the Attack of the Killer Micros , 2006 .
[58] Samuel Williams,et al. Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.
[59] Gerhard Wellein,et al. DMRG Investigation of Stripe Formation in Doped Hubbard Ladders , 2005 .
[60] Gerhard Wellein,et al. Performance of Scientific Applications on Modern Supercomputers , 2005 .
[61] Gerhard Wellein,et al. LIKWID: A Lightweight Performance-Oriented Tool Suite for x86 Multicore Environments , 2010, 2010 39th International Conference on Parallel Processing Workshops.