Optimizing the performance of streaming numerical kernels on the IBM Blue Gene/P PowerPC 450 processor
暂无分享,去创建一个
David E. Keyes | John A. Gunnels | Jed Brown | Aron J. Ahmadia | Tareq M. Malas | D. Keyes | Jed Brown | A. Ahmadia | T. Malas
[1] Liu Peng,et al. High-order stencil computations on multicore clusters , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.
[2] Chau-Wen Tseng,et al. Tiling Optimizations for 3D Scientific Computations , 2000, ACM/IEEE SC 2000 Conference (SC'00).
[3] Kent Wilken,et al. Optimal instruction scheduling using integer programming , 2000, PLDI.
[4] William L. Briggs,et al. A multigrid tutorial , 1987 .
[5] Pradeep Dubey,et al. Larrabee: A Many-Core x86 Architecture for Visual Computing , 2009, IEEE Micro.
[6] John A. Gunnels,et al. Beyond homogeneous decomposition: scaling long-range forces on Massively Parallel Systems , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.
[7] Sanjit A. Seshia,et al. Sketching stencils , 2007, PLDI '07.
[8] Gerhard Wellein,et al. Efficient Temporal Blocking for Stencil Computations by Multicore-Aware Wavefront Parallelization , 2009, 2009 33rd Annual IEEE International Computer Software and Applications Conference.
[9] Peng Wu,et al. Vectorization for SIMD architectures with alignment constraints , 2004, PLDI '04.
[10] Volker Strumpen,et al. Cache oblivious stencil computations , 2005, ICS '05.
[11] Helmar Burkhart,et al. Automatic code generation and tuning for stencil kernels on modern shared memory architectures , 2011, Computer Science - Research and Development.
[12] Chung-Ta King,et al. Using integer linear programming for instruction scheduling and register allocation in multi-issue processors , 1997 .
[13] Peter Messmer,et al. Parallel data-locality aware stencil computations on modern micro-architectures , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.
[14] David H. Bailey,et al. The Nas Parallel Benchmarks , 1991, Int. J. High Perform. Comput. Appl..
[15] Samuel Williams,et al. Implicit and explicit optimizations for stencil computations , 2006, MSPC '06.
[16] David E. Keyes,et al. Exaflop/s: The why and the how , 2011 .
[17] Archana Ganapathi,et al. A case for machine learning to optimize multicore performance , 2009 .
[18] Abid M. Malik,et al. Constraint Programming Techniques for Optimal Instruction Scheduling , 2008 .
[19] Chi-Wang Shu,et al. High order finite difference and finite volume WENO schemes and discontinuous Galerkin methods for CFD , 2001 .
[20] Konstantin Makarychev,et al. Indexing genomic sequences on the IBM Blue Gene , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.
[21] Georg Hager,et al. Introducing a Performance Model for Bandwidth-Limited Loop Kernels , 2009, PPAM.
[22] Wu-chun Feng,et al. The Green500 List: Encouraging Sustainable Supercomputing , 2007, Computer.
[23] Bradley C. Kuszmaul,et al. The pochoir stencil compiler , 2011, SPAA '11.
[24] Dharmendra S. Modha,et al. The cat is out of the bag: cortical simulations with 109 neurons, 1013 synapses , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.
[25] Leonid Oliker,et al. Impact of modern memory subsystems on cache optimizations for stencil computations , 2005, MSP '05.
[26] LiZhiyuan,et al. Automatic tiling of iterative stencil loops , 2004 .
[27] Ibm Blue,et al. Overview of the IBM Blue Gene/P Project , 2008, IBM J. Res. Dev..
[28] Weiqiang Wang,et al. A Multilevel Parallelization Framework for High-Order Stencil Computations , 2009, Euro-Par.
[29] Mauricio Hanzich,et al. 3D seismic imaging through reverse-time migration on homogeneous and heterogeneous multi-core processors , 2009, Sci. Program..
[30] Ken Kennedy,et al. Estimating Interlock and Improving Balance for Pipelined Architectures , 1988, J. Parallel Distributed Comput..
[31] Uday Bondhugula,et al. Effective automatic parallelization of stencil computations , 2007, PLDI '07.
[32] John Shalf,et al. The International Exascale Software Project roadmap , 2011, Int. J. High Perform. Comput. Appl..
[33] Ken Kennedy,et al. Improving the ratio of memory operations to floating-point operations in loops , 1994, TOPL.
[34] Michael D. McCool,et al. Intel's Array Building Blocks: A retargetable, dynamic compiler and embedded language , 2011, International Symposium on Code Generation and Optimization (CGO 2011).
[35] Zhiyuan Li,et al. Automatic tiling of iterative stencil loops , 2004, TOPL.
[36] Katherine Yelick,et al. Auto-tuning stencil codes for cache-based multicore platforms , 2009 .
[37] M. Suzuoki,et al. Overview of the architecture, circuit design, and physical implementation of a first-generation cell processor , 2006, IEEE Journal of Solid-State Circuits.
[38] Samuel Williams,et al. Lattice Boltzmann simulation optimization on leading multicore platforms , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.
[39] M. Berger,et al. Adaptive mesh refinement for hyperbolic partial differential equations , 1982 .
[40] FengWu-chun,et al. The Green500 List , 2007 .
[41] Pradeep Dubey,et al. 3.5-D Blocking Optimization for Stencil Computations on Modern CPUs and GPUs , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.
[42] Franz Franchetti,et al. Data Layout Transformation for Stencil Computations on Short-Vector SIMD Architectures , 2011, CC.
[43] Thomas R. Gross,et al. Postpass Code Optimization of Pipeline Constraints , 1983, TOPL.
[44] Mauricio Hanzich,et al. 3 D seismic imaging through reverse-time migration on homogeneous and heterogeneous multi-core processors , 2014 .
[45] Samuel Williams,et al. An auto-tuning framework for parallel multicore stencil computations , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).
[46] Ibm Redbooks,et al. IBM System Blue Gene Solution: Blue Gene/P Application Development , 2009 .
[47] Gerhard Wellein,et al. Leveraging Shared Caches for Parallel Temporal Blocking of Stencil Codes on Multicore Processors and Clusters , 2010, Parallel Process. Lett..
[48] Erik Lindholm,et al. NVIDIA Tesla: A Unified Graphics and Computing Architecture , 2008, IEEE Micro.
[49] Samuel Williams,et al. Auto-Tuning the 27-point Stencil for Multicore , 2009 .