623 Tflop/s HPCG run on Tianhe-2: Leveraging millions of hybrid cores
暂无分享,去创建一个
Chao Yang | Yiqun Liu | Canqun Yang | Fangfang Liu | Xiangke Liao | Xianyi Zhang | Yutong Lu | Yunfei Du | Min Xie | Yunfei Du | Yutong Lu | Xiangke Liao | Xianyi Zhang | Canqun Yang | Chao Yang | Yiqung Liu | Min Xie | Fangfang Liu
[1] Shreekant S. Thakkar,et al. Synchronization algorithms for shared-memory multiprocessors , 1990, Computer.
[2] Pradeep Dubey,et al. Sparsifying Synchronization for High-Performance Shared-Memory Sparse Triangular Solver , 2014, ISC.
[3] Sandia Report,et al. Toward a New Metric for Ranking High Performance Computing Systems , 2013 .
[4] Chao Yang,et al. Optimizing and Scaling HPCG on Tianhe-2: Early Experience , 2014, ICA3PP.
[5] Xing Liu,et al. Efficient sparse matrix-vector multiplication on x86-based many-core processors , 2013, ICS '13.
[6] Gerhard Wellein,et al. Efficient Temporal Blocking for Stencil Computations by Multicore-Aware Wavefront Parallelization , 2009, 2009 33rd Annual IEEE International Computer Software and Applications Conference.
[7] Samuel Williams,et al. Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.
[8] Ulrich Rüde,et al. Cache-Aware Multigrid Methods for Solving Poisson's Equation in Two Dimensions , 2000, Computing.
[9] Chao Yang,et al. Accelerating HPCG on Tianhe-2: A hybrid CPU-MIC algorithm , 2014, 2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS).
[10] Massimiliano Fatica,et al. A CUDA Implementation of the High Performance Conjugate Gradient Benchmark , 2014, PMBS@SC.
[11] Chao Yang,et al. Enabling and Scaling a Global Shallow-Water Atmospheric Model on Tianhe-2 , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.
[12] Takeshi Iwashita,et al. Block Red-Black Ordering: A New Ordering Strategy for Parallelization of ICCG Method , 2004, International Journal of Parallel Programming.
[13] Samuel Williams,et al. Optimization of sparse matrix-vector multiplication on emerging multicore platforms , 2009, Parallel Comput..
[14] Katherine Yelick,et al. OSKI: A library of automatically tuned sparse matrix kernels , 2005 .
[15] Michael L. Scott,et al. Algorithms for scalable synchronization on shared-memory multiprocessors , 1991, TOCS.
[16] Arutyun Avetisyan,et al. Automatically Tuning Sparse Matrix-Vector Multiplication for GPU Architectures , 2010, HiPEAC.
[17] Pradeep Dubey,et al. Efficient Shared-Memory Implementation of High-Performance Conjugate Gradient Benchmark and its Application to Unstructured Matrices , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.
[18] David G. Wonnacott,et al. Using time skewing to eliminate idle time due to memory bandwidth and network limitations , 2000, Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000.
[19] Samuel Williams,et al. Optimization of geometric multigrid for emerging multi- and manycore processors , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.
[20] Erik Hagersten,et al. Multigrid and Gauss-Seidel smoothers revisited: parallelization on chip multiprocessors , 2006, ICS '06.
[21] Jérémie Allard,et al. Parallel Dense Gauss-Seidel Algorithm on Many-Core Processors , 2009, 2009 11th IEEE International Conference on High Performance Computing and Communications.