Automatic Code Generation and Optimization of Large-scale Stencil Computation on Many-core Processors
暂无分享,去创建一个
Depei Qian | Zhongzhi Luan | Qingxiao Sun | Yongmin Hu | Yi Liu | Hailong Yang | Mingzhen Li | Xin You | Xiaoyan Liu | Bangduo Chen | D. Qian | Hailong Yang | Zhongzhi Luan | Yi Liu | Xin You | Mingzhen Li | Xiaoyan Liu | Qingxiao Sun | Yongmin Hu | Bangduo Chen
[1] Shoaib Kamil,et al. Distributed Halide , 2016, PPoPP.
[2] Wei Ge,et al. The Sunway TaihuLight supercomputer: system and applications , 2016, Science China Information Sciences.
[3] Pradeep Dubey,et al. 3.5-D Blocking Optimization for Stencil Computations on Modern CPUs and GPUs , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.
[4] Sergei Gorlatch,et al. High performance stencil code generation with Lift , 2018, CGO.
[5] Michael E. Wolf,et al. Combining Loop Transformations Considering Caches and Scheduling , 2004, International Journal of Parallel Programming.
[6] Guangwen Yang,et al. swDNN: A Library for Accelerating Deep Learning Applications on Sunway TaihuLight , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS).
[7] Tobias Gysi,et al. STELLA: a domain-specific tool for structured grid methods in weather and climate models , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.
[8] Helmar Burkhart,et al. PATUS: A Code Generation and Autotuning Framework for Parallel Iterative Stencil Computations on Modern Microarchitectures , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.
[9] Satoshi Matsuoka,et al. Physis: An implicitly parallel programming model for stencil computations on large-scale GPU-accelerated supercomputers , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[10] Shoaib Kamil,et al. ParSy: Inspection and Transformation of Sparse Matrix Computations for Parallelism , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.
[11] Uday Bondhugula,et al. A practical automatic polyhedral parallelizer and locality optimizer , 2008, PLDI '08.
[12] P. Sadayappan,et al. High-performance code generation for stencil computations on GPU architectures , 2012, ICS '12.
[13] V. Natoli,et al. Exploring New Architectures in Accelerating CFD for Air Force Applications , 2008, 2008 DoD HPCMP Users Group Conference.
[14] David E. Keyes,et al. Multicore-Optimized Wavefront Diamond Blocking for Optimizing Stencil Updates , 2014, SIAM J. Sci. Comput..
[15] Albert Cohen,et al. Hybrid Hexagonal/Classical Tiling for GPUs , 2014, CGO '14.
[16] Guangwen Yang,et al. Massively Scaling Seismic Processing on Sunway TaihuLight Supercomputer , 2020, IEEE Transactions on Parallel and Distributed Systems.
[17] P. Sadayappan,et al. On Optimizing Complex Stencils on GPUs , 2019, 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS).
[18] Weiguo Liu,et al. 18.9-Pflops Nonlinear Earthquake Simulation on Sunway TaihuLight: Enabling Depiction of 18-Hz and 8-Meter Scenarios , 2017, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.
[19] Frédo Durand,et al. Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines , 2013, PLDI.
[20] Alejandro Duran,et al. YASK—Yet Another Stencil Kernel: A Framework for HPC Stencil Code-Generation and Tuning , 2016, 2016 Sixth International Workshop on Domain-Specific Languages and High-Level Frameworks for High Performance Computing (WOLFHPC).
[21] Chao Yang,et al. 26 PFLOPS Stencil Computations for Atmospheric Modeling on Sunway TaihuLight , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS).
[22] Hongbin Zheng,et al. Polly – Polyhedral optimization in LLVM , 2012 .
[23] Depei Qian,et al. Performance Evaluation and Analysis of Linear Algebra Kernels in the Prototype Tianhe-3 Cluster , 2019, SCFA.
[24] Chau-Wen Tseng,et al. Tiling Optimizations for 3D Scientific Computations , 2000, ACM/IEEE SC 2000 Conference (SC'00).
[25] Shoaib Kamil,et al. Tiramisu: A Polyhedral Compiler for Expressing Fast and Portable Code , 2018, 2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).
[26] Chao Yang,et al. 10M-Core Scalable Fully-Implicit Solver for Nonhydrostatic Atmospheric Dynamics , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.
[27] Frédo Durand,et al. Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines , 2013, PLDI 2013.
[28] Bradford L. Chamberlain,et al. Parameterized Diamond Tiling for Stencil Computations with Chapel parallel iterators , 2015, ICS.
[29] Gihan R. Mudalige,et al. Loop Tiling in Large-Scale Stencil Codes at Run-Time with OPS , 2017, IEEE Transactions on Parallel and Distributed Systems.
[30] Volker Strumpen,et al. Cache oblivious stencil computations , 2005, ICS '05.
[31] Jia Guo,et al. Writing productive stencil codes with overlapped tiling , 2009 .
[32] Shoaib Kamil,et al. OpenTuner: An extensible framework for program autotuning , 2014, 2014 23rd International Conference on Parallel Architecture and Compilation (PACT).
[33] Guangwen Yang,et al. Optimizing high-resolution Community Earth System Model on a heterogeneous many-core supercomputing platform , 2020 .
[34] Richard Veras,et al. A stencil compiler for short-vector SIMD architectures , 2013, ICS '13.
[35] Mohamed Wahib,et al. AN5D: automated stencil framework for high-degree temporal blocking on GPUs , 2020, CGO.
[36] Allen Taflove,et al. Computational Electrodynamics the Finite-Difference Time-Domain Method , 1995 .
[37] Bradley C. Kuszmaul,et al. The pochoir stencil compiler , 2011, SPAA '11.
[38] Charles Yount,et al. Architecture and Performance of Devito, a System for Automated Stencil Computation , 2018, ACM Trans. Math. Softw..
[39] Andreas Klöckner,et al. Loo.py: transformation-based code generation for GPUs and CPUs , 2014, ARRAY@PLDI.
[40] Guangwen Yang,et al. Accelerating Sparse Cholesky Factorization on Sunway Manycore Architecture , 2020, IEEE Transactions on Parallel and Distributed Systems.