Parallel Low-Storage Runge—Kutta Solvers for ODE Systems with Limited Access Distance

We consider the solution of initial value problems (IVPs) of large systems of ordinary differential equations (ODEs) for which memory space requirements determine the choice of the integration method. In particular, we discuss the space-efficient sequential and parallel implementation of embedded Runge—Kutta (RK) methods. Our focus is on the exploitation of a special structure of commonly appearing ODE systems, referred to as ‘‘limited access distance,’’ to improve scalability and memory usage. Such systems may arise, for example, from the semi-discretization of partial differential equations (PDEs). The storage space required by classical RK methods is directly proportional to the dimension n of the ODE system and the number of stages s of the method. We propose an implementation strategy based on a pipelined processing of the stages of the RK method and show how the memory usage of this computation scheme can be reduced to less than three storage registers by an overlapping of vectors without compromising the choice of method coefficients or the potential for efficient stepsize control. We analyze and compare the scalability of different parallel implementation strategies in detailed runtime experiments on different modern parallel architectures.

[1]  J. Dormand,et al.  A family of embedded Runge-Kutta formulae , 1980 .

[2]  David R. Butenhof Programming with POSIX threads , 1993 .

[3]  Erwin Fehlberg,et al.  Klassische Runge-Kutta-Formeln fünfter und siebenter Ordnung mit Schrittweiten-Kontrolle , 1969, Computing.

[4]  Nguyen Huu Cong,et al.  Twostep-by-twostep PIRK-type PC methods with continuous output formulas , 2008 .

[5]  Werner Augustin,et al.  Optimized Stencil Computation Using In-Place Calculation on Modern Multicore Systems , 2009, Euro-Par.

[6]  Christophe Bailly,et al.  Optimized explicit schemes: matching and boundary schemes and 4th-order Runge-Kutta algorithm. , 2004 .

[7]  Kathryn S. McKinley,et al.  A Compiler Optimization Algorithm for Shared-Memory Multiprocessors , 1998, IEEE Trans. Parallel Distributed Syst..

[8]  E. Fehlberg Classical Fifth-, Sixth-, Seventh-, and Eighth-Order Runge-Kutta Formulas with Stepsize Control , 1968 .

[9]  Steven J. Ruuth Global optimization of explicit strong-stability-preserving Runge-Kutta methods , 2005, Math. Comput..

[10]  Chun Chen,et al.  A scalable auto-tuning framework for compiler optimization , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[11]  Kevin Burrage,et al.  Parallel and sequential methods for ordinary differential equations , 1995, Numerical analysis and scientific computation.

[12]  Xiao-Wei Shi,et al.  AN IMPROVED ALGORITHM FOR MATRIX BANDWIDTH AND PROFILE REDUCTION IN FINITE ELEMENT ANALYSIS , 2009 .

[13]  Yuefan Deng,et al.  New trends in high performance computing , 2001, Parallel Computing.

[14]  Manuel Calvo,et al.  Short note: a new minimum storage Runge-Kutta scheme for computational acoustics , 2004 .

[15]  Mahmut T. Kandemir,et al.  Compiler algorithms for optimizing locality and parallelism on shared and distributed memory machines , 1997, Proceedings 1997 International Conference on Parallel Architectures and Compilation Techniques.

[16]  Michael Wolfe,et al.  High performance compilers for parallel computing , 1995 .

[17]  Kenneth R. Jackson,et al.  The potential for parallelism in Runge-Kutta methods. Part 1: RK formulas in standard form , 1995 .

[18]  Peter Deuflhard,et al.  Massively Parallel Linearly-Implicit Extrapolation Algorithms as a Powerful Tool in Process Simulation , 1997, PARCO.

[19]  J. Dormand,et al.  High order embedded Runge-Kutta formulae , 1981 .

[20]  Rafael Martí,et al.  A branch and bound algorithm for the matrix bandwidth minimization , 2008, Eur. J. Oper. Res..

[21]  Albert E. Ruehli,et al.  WAVEFORM RELAXATION: THEORY AND PRACTICE , 1985 .

[22]  Alberto L. Sangiovanni-Vincentelli,et al.  The Waveform Relaxation Method for Time-Domain Analysis of Large Scale Integrated Circuits , 1982, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[23]  Thomas Rauber,et al.  Scalable Parallel RK Solvers for ODEs Derived by the Method of Lines , 2003, Euro-Par.

[24]  Peter Deuflhard,et al.  Recent progress in extrapolation methods for ordinary differential equations , 1985 .

[25]  M. Kiehl,et al.  Optimized extrapolation methods for parallel solution of IVPs on different computer architectures , 1996 .

[26]  Jaeyoung Choi,et al.  Design and Implementation of the ScaLAPACK LU, QR, and Cholesky Factorization Routines , 1994, Sci. Program..

[27]  R. Lewis,et al.  Low-storage, Explicit Runge-Kutta Schemes for the Compressible Navier-Stokes Equations , 2000 .

[28]  Helmut Podhaisky,et al.  Numerik gewöhnlicher Differentialgleichungen , 2012 .

[29]  Markus Kowarschik,et al.  Data locality optimizations for iterative numerical algorithms and cellular automata on hierarchical memory architectures , 2004, Advances in simulation.

[30]  James Demmel,et al.  Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology , 1997, ICS '97.

[31]  Thomas Rauber,et al.  Optimizing locality and scalability of embedded Runge-Kutta solvers using block-based pipelining , 2006, J. Parallel Distributed Comput..

[32]  Ken Kennedy,et al.  Optimizing Compilers for Modern Architectures: A Dependence-based Approach , 2001 .

[33]  C. Bogey,et al.  Low-dissipation and low-dispersion fourth-order Runge–Kutta algorithm , 2006 .

[34]  Thomas Rauber,et al.  Improving locality for ODE solvers by program transformations , 2004, Sci. Program..

[35]  J. Demmel,et al.  Sun Microsystems , 1996 .

[36]  Ernst Hairer,et al.  Solving Ordinary Differential Equations I: Nonstiff Problems , 2009 .

[37]  Guohua Jin,et al.  Increasing Temporal Locality with Skewing and Recursive Blocking , 2001, ACM/IEEE SC 2001 Conference (SC'01).

[38]  Charles William Gear,et al.  Parallelism across time in ODEs , 1993 .

[39]  Christopher A. Kennedy,et al.  Third-order 2N-storage Runge-Kutta schemes with error control , 1994 .

[40]  Kang Su Gatlin,et al.  Architecture-Cognizant Divide and Conquer Algorithms , 1999, ACM/IEEE SC 1999 Conference (SC'99).

[41]  Jack Dongarra,et al.  MPI: The Complete Reference , 1996 .

[42]  P. Houwen,et al.  Parallel iteration of high-order Runge-Kutta methods with stepsize control , 1990 .

[43]  Thomas Rauber,et al.  Parallel execution of embedded and iterated Runge-Kutta methods , 1999, Concurr. Pract. Exp..

[44]  Jack J. Dongarra,et al.  Automated empirical optimizations of software and the ATLAS project , 2001, Parallel Comput..

[45]  William E. Schiesser The numerical method of lines , 1991 .

[46]  Harald H. Simonsen,et al.  Aspects of parallel Runge-Kutta methods , 1989 .

[47]  J. Ramanujam,et al.  Parameterized tiling revisited , 2010, CGO '10.

[48]  Matthias Korch Effiziente Implementierung eingebetteter Runge-Kutta-Verfahren durch Ausnutzung der Speicherzugriffslokalität , 2006 .