High Performance Computing

This paper presents a GPU implementation of an asynchronous iterative algorithm for computing incomplete factorizations. Asynchronous algorithms, with their ability to tolerate memory latency, form an important class of algorithms for modern computer architectures. Our GPU implementation considers several non-traditional techniques that can be important for asynchronous algorithms to optimize convergence and data locality. These techniques include controlling the order in which variables are updated by controlling the order of execution of thread blocks, taking advantage of cache reuse between thread blocks, and managing the amount of parallelism to control the convergence of

[1]  Takuji Nishimura,et al.  Mersenne twister: a 623-dimensionally equidistributed uniform pseudo-random number generator , 1998, TOMC.

[2]  A. Moody The Scalable Checkpoint/Restart Library , 2009 .

[3]  N. Hengartner,et al.  Predicting the number of fatal soft errors in Los Alamos national laboratory's ASC Q supercomputer , 2005, IEEE Transactions on Device and Materials Reliability.

[4]  John T. Daly,et al.  A higher order estimate of the optimum checkpoint interval for restart dumps , 2006, Future Gener. Comput. Syst..

[5]  Bianca Schroeder,et al.  A Large-Scale Study of Failures in High-Performance Computing Systems , 2010, IEEE Trans. Dependable Secur. Comput..

[6]  Bronis R. de Supinski,et al.  Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[7]  James S. Plank,et al.  Processor Allocation and Checkpoint Interval Selection in Cluster Computing Systems , 2001, J. Parallel Distributed Comput..

[8]  Kamil Iskra,et al.  ZOID: I/O-forwarding infrastructure for petascale architectures , 2008, PPoPP.

[9]  Gautham Krishnamoorthy,et al.  Parallelization of the P-1 Radiation Model , 2006 .

[10]  P. Colella,et al.  An Adaptive Mesh Refinement Algorithm for the Radiative Transport Equation , 1998 .

[11]  Qingyu Meng,et al.  Investigating applications portability with the uintah DAG-based runtime system on petascale supercomputers , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[12]  Justin Luitjens,et al.  Scalable parallel regridding algorithms for block‐structured adaptive mesh refinement , 2011, Concurr. Comput. Pract. Exp..

[13]  Xiaojing Sun,et al.  A Parametric Case Study in Radiative Heat Transfer Using the Reverse Monte-Carlo Ray-Tracing With Full-Spectrum k-Distribution Method , 2010 .

[14]  Michael Pernice,et al.  Solution of Equilibrium Radiation Diffusion Problems Using Implicit Adaptive Mesh Refinement , 2005, SIAM J. Sci. Comput..

[15]  Darren J. Kerbyson A look at application performance sensitivity to the bandwidth and latency of InfiniBand networks , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[16]  Martin Berzins,et al.  Large Scale Parallel Solution of Incompressible Flow Problems Using Uintah and Hypre , 2013, 2013 13th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing.

[17]  G. Mellema,et al.  Hybrid Characteristics: 3D radiative transfer for parallel adaptive mesh refinement hydrodynamics , 2005, astro-ph/0505213.

[18]  Philip J. Smith,et al.  Heat Transfer To Objects In Pool Fires , 2008 .

[19]  Stephen L. Scott,et al.  Reliability-Aware Approach: An Incremental Checkpoint/Restart Model in HPC Environments , 2008, 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID).

[20]  Xiaojing Sun Reverse Monte Carlo ray-tracing for radiative heat transfer in combustion systems , 2009 .

[21]  Michael F. Modest,et al.  Backward Monte Carlo Simulations in Radiative Heat Transfer , 2003 .

[22]  Justin Luitjens,et al.  Improving the performance of Uintah: A large-scale adaptive meshing computational framework , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[23]  Andrzej Duda,et al.  The Effects of Checkpointing on Program Execution Time , 1983, Inf. Process. Lett..

[24]  Qingyu Meng,et al.  The uintah framework: a unified heterogeneous task scheduling and runtime system , 2012, 2012 SC Companion: High Performance Computing, Networking Storage and Analysis.

[25]  Todd Harman,et al.  Efficient Parallelization of RMCRT for Large Scale LES Combustion Simulations , 2011 .

[26]  Qingyu Meng,et al.  Using hybrid parallelism to improve memory use in the Uintah framework , 2011 .

[27]  Justin Luitjens,et al.  Dynamic task scheduling for the Uintah framework , 2010, 2010 3rd Workshop on Many-Task Computing on Grids and Supercomputers.

[28]  G. Bryan,et al.  Introducing Enzo, an AMR Cosmology Application , 2004, astro-ph/0403044.

[29]  Paul E. Plassmann,et al.  Parallel Load Balancing Heuristics for Radiative Heat Transfer Calculations , 2006, CSC.

[30]  John A. Gunnels,et al.  Extending stability beyond CPU millennium: a micron-scale atomistic simulation of Kelvin-Helmholtz instability , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[31]  Qingyu Meng,et al.  Scalable large‐scale fluid–structure interaction solvers in the Uintah framework via hybrid task‐based parallelism algorithms , 2014, Concurr. Comput. Pract. Exp..

[32]  Bianca Schroeder,et al.  Understanding failures in petascale computers , 2007 .

[33]  Rajeev Thakur,et al.  Optimization of Collective Communication Operations in MPICH , 2005, Int. J. High Perform. Comput. Appl..