An efficient mixed-precision, hybrid CPU-GPU implementation of a nonlinearly implicit one-dimensional particle-in-cell algorithm
暂无分享,去创建一个
[1] Viktor K. Decyk,et al. Adaptable Particle-in-Cell algorithms for graphical processing units , 2010, Comput. Phys. Commun..
[2] Samuel H. Fuller,et al. The Future of Computing Performance: Game Over or Next Level? , 2014 .
[3] Kurt Keutzer,et al. The Concurrency Challenge , 2008, IEEE Design & Test of Computers.
[4] H Burau,et al. PIConGPU: A Fully Relativistic Particle-in-Cell Code for a GPU Cluster , 2010, IEEE Transactions on Plasma Science.
[5] Rodney A. Kennedy,et al. Efficient Histogram Algorithms for NVIDIA CUDA Compatible Devices , 2007 .
[6] Viktor K. Decyk,et al. A general concurrent algorithm for plasma particle-in-cell simulation codes , 1989 .
[7] William Daughton,et al. Advances in petascale kinetic plasma simulation with VPIC and Roadrunner , 2009 .
[8] James Demmel,et al. Benchmarking GPUs to tune dense linear algebra , 2008, HiPC 2008.
[9] Sally A. McKee,et al. Hitting the memory wall: implications of the obvious , 1995, CARN.
[10] Andreas Moshovos,et al. Demystifying GPU microarchitecture through microbenchmarking , 2010, 2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS).
[11] K. Bohmer. Defect Correction Methods: Theory and Applications , 1984 .
[12] C. Birdsall,et al. Plasma Physics via Computer Simulation , 2018 .
[13] Samuel Williams,et al. Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.
[14] J D Littler,et al. A PROOF OF THE QUEUING FORMULA , 1961 .
[15] Nail A. Gumerov,et al. Fast parallel Particle-To-Grid interpolation for plasma PIC simulations on the GPU , 2008, J. Parallel Distributed Comput..
[16] Emmett Kilgariff,et al. Fermi GF100 GPU Architecture , 2011, IEEE Micro.
[17] Wen-mei W. Hwu,et al. Optimization principles and application performance evaluation of a multithreaded GPU using CUDA , 2008, PPoPP.
[18] Kevin J. Bowers,et al. Accelerating a paricle -in-cell simulation using a hybrid counting sort , 2001 .
[19] Samuel Williams,et al. Memory-efficient optimization of Gyrokinetic particle-to-grid interpolation for multicore processors , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.
[20] Julien Langou,et al. Exploiting Mixed Precision Floating Point Hardware in Scientific Computations , 2006, High Performance Computing Workshop.
[21] Luis Chacón,et al. An energy- and charge-conserving, implicit, electrostatic particle-in-cell algorithm , 2011, J. Comput. Phys..
[22] William J. Dally,et al. The GPU Computing Era , 2010, IEEE Micro.
[23] Jie Cheng,et al. Programming Massively Parallel Processors. A Hands-on Approach , 2010, Scalable Comput. Pract. Exp..
[24] L. Shampine. Error estimation and control for ODEs , 2005 .
[25] Samuel Williams,et al. Gyrokinetic toroidal simulations on leading multi- and manycore HPC systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[26] Peter W. Markstein,et al. IA-64 and elementary functions - speed and precision , 2000 .
[27] Michael C. Huang,et al. Particle-in-cell simulations with charge-conserving current deposition on graphic processing units , 2010, J. Comput. Phys..
[28] Samuel Williams,et al. Gyrokinetic particle-in-cell optimization on emerging multi- and manycore platforms , 2011, Parallel Comput..
[29] David A. Patterson,et al. Computer Organization and Design, Fourth Edition, Fourth Edition: The Hardware/Software Interface (The Morgan Kaufmann Series in Computer Architecture and Design) , 2008 .
[30] Mark J. Harris,et al. Parallel Prefix Sum (Scan) with CUDA , 2011 .
[31] Benjamin Bergen,et al. 0.374 Pflop/s trillion-particle kinetic modeling of laser plasma interaction on roadrunner , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.
[32] Norbert Luttenberger,et al. A Novel Sorting Algorithm for Many-core Architectures Based on Adaptive Bitonic Sort , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.
[33] J. Little. A Proof for the Queuing Formula: L = λW , 1961 .
[34] Wen-mei W. Hwu,et al. Compute Unified Device Architecture Application Suitability , 2009, Computing in Science & Engineering.