A parallel linear solver for multilevel Toeplitz systems with possibly several right-hand sides

Abstract A Toeplitz matrix has constant diagonals; a multilevel Toeplitz matrix is defined recursively with respect to the levels by replacing the matrix elements with Toeplitz blocks. Multilevel Toeplitz linear systems appear in a wide range of applications in science and engineering. This paper discusses an MPI implementation for solving such a linear system by using the conjugate gradient algorithm. The implementation techniques can be generalized to other iterative Krylov methods besides conjugate gradient. These techniques include the use of an arbitrary dimensional process grid for handling the multilevel Toeplitz structure, a communication-hiding approach for performing matrix–vector multiplications, the incorporation of multilevel circulant preconditioning for accelerating convergence, an efficient orthogonalization manager for detecting linear dependence in block iterations, and an algorithmic rearrangement to eliminate all-reduce synchronizations. The combined use of these techniques leads to a scalable solver for large multilevel Toeplitz systems, possibly with several right-hand sides. We show experimental results on matrices of size up to the order of one billion with nearly perfect scaling by using up to 1024 MPI processes. We also demonstrate an application of the solver in parameter estimation for analyzing large-scale climate data.

[1]  Amir Averbuch,et al.  Timing analysis of a parallel algorithm for Toeplitz matrices on a MIMD parallel machine , 1991, Parallel Comput..

[2]  J. Shalf,et al.  A Hybrid MPI / OpenMP 3 D FFT for Plane Wave First-principles Materials Science Codes , 2012 .

[3]  S. Capizzano Matrix algebra preconditioners for multilevel Toeplitz matrices are not superlinear , 2002 .

[4]  J. Chilès,et al.  Geostatistics: Modeling Spatial Uncertainty , 1999 .

[5]  Roger Woodard,et al.  Interpolation of Spatial Data: Some Theory for Kriging , 1999, Technometrics.

[6]  Torsten Hoefler,et al.  Implementation and performance analysis of non-blocking collective operations for MPI , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[7]  Holger Wendland,et al.  Scattered Data Approximation: Conditionally positive definite functions , 2004 .

[8]  Dmitry Pekurovsky,et al.  P3DFFT: A Framework for Parallel Computations of Fourier Transforms in Three Dimensions , 2012, SIAM J. Sci. Comput..

[9]  Andrew Canning Scalable Parallel 3d FFTs for Electronic Structure Codes , 2008, VECPAR.

[10]  Torsten Hoefler,et al.  Optimizing non-blocking collective operations for infiniband , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[11]  James Demmel,et al.  Communication-optimal Parallel and Sequential QR and LU Factorizations , 2008, SIAM J. Sci. Comput..

[12]  A. NikishinA.,et al.  Variable Block CG Algorithms for Solving Large Sparse Symmetric Positive Definite Linear Systems on Parallel Computers, I , 1995 .

[13]  Yousef Saad,et al.  Iterative methods for sparse linear systems , 2003 .

[14]  James Durbin,et al.  The fitting of time series models , 1960 .

[15]  Shivkumar Chandrasekaran,et al.  A Superfast Algorithm for Toeplitz Systems of Linear Equations , 2007, SIAM J. Matrix Anal. Appl..

[17]  T. Chan An Optimal Circulant Preconditioner for Toeplitz Systems , 1988 .

[18]  M. Anitescu,et al.  STOCHASTIC APPROXIMATION OF SCORE FUNCTIONS FOR GAUSSIAN PROCESSES , 2013, 1312.2687.

[19]  Mark Frederick Hoemmen,et al.  An Overview of Trilinos , 2003 .

[20]  W. Gragg,et al.  Superfast solution of real positive definite toeplitz systems , 1988 .

[21]  J. L. Hock,et al.  An exact recursion for the composite nearest‐neighbor degeneracy for a 2×N lattice space , 1984 .

[22]  Sayantan Sur,et al.  High-performance and scalable non-blocking all-to-all with collective offload on InfiniBand clusters: a study with parallel 3D FFT , 2011, Computer Science - Research and Development.

[23]  Kesheng Wu,et al.  A Block Orthogonalization Procedure with Constant Synchronization Requirements , 2000, SIAM J. Sci. Comput..

[24]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[25]  Andy A. Nikishin,et al.  Variable Block CG Algorithms for Solving Large Sparse Symmetric Positive Definite Linear Systems on Parallel Computers, I: General Iterative Scheme , 1995, SIAM J. Matrix Anal. Appl..

[26]  Gene H. Golub,et al.  Matrix computations , 1983 .

[27]  N. Levinson The Wiener (Root Mean Square) Error Criterion in Filter Design and Prediction , 1946 .

[28]  Mark Hoemmen,et al.  A Communication-Avoiding, Hybrid-Parallel, Rank-Revealing Orthogonalization Method , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[29]  R. Chan,et al.  An Introduction to Iterative Toeplitz Solvers , 2007 .

[30]  D. O’Leary The block conjugate gradient algorithm and related methods , 1980 .

[31]  Steven G. Johnson,et al.  The Design and Implementation of FFTW3 , 2005, Proceedings of the IEEE.

[32]  E. Bareiss Numerical solution of linear equations with Toeplitz and Vector Toeplitz matrices , 1969 .

[33]  Richard P. Brent,et al.  Parallel Algorithms for Toeplitz Systems , 1991 .

[34]  Jack Dongarra,et al.  LAPACK Users' Guide, 3rd ed. , 1999 .

[35]  Michael Stewart,et al.  A Superfast Toeplitz Solver with Improved Numerical Stability , 2003, SIAM J. Matrix Anal. Appl..

[36]  MIHAI ANITESCU,et al.  A Matrix-free Approach for Solving the Parametric Gaussian Process Maximum Likelihood Problem , 2012, SIAM J. Sci. Comput..

[37]  V.Y. Pan,et al.  Concurrent Iterative Algorithm for Toeplitz-like Linear Systems , 1993, IEEE Trans. Parallel Distributed Syst..