Extreme-Scale Task-Based Cholesky Factorization Toward Climate and Weather Prediction Applications

Climate and weather can be predicted statistically via geospatial Maximum Likelihood Estimates (MLE), as an alternative to running large ensembles of forward models. The MLE-based iterative optimization procedure requires the solving of large-scale linear systems that performs a Cholesky factorization on a symmetric positive-definite covariance matrix---a demanding dense factorization in terms of memory footprint and computation. We propose a novel solution to this problem: at the mathematical level, we reduce the computational requirement by exploiting the data sparsity structure of the matrix off-diagonal tiles by means of low-rank approximations; and, at the programming-paradigm level, we integrate PaRSEC, a dynamic, task-based runtime to reach unparalleled levels of efficiency for solving extreme-scale linear algebra matrix operations. The resulting solution leverages fine-grained computations to facilitate asynchronous execution while providing a flexible data distribution to mitigate load imbalance. Performance results are reported using 3D synthetic datasets up to 42M geospatial locations on 130, 000 cores, which represent a cornerstone toward fast and accurate predictions of environmental applications.

[1]  David E. Keyes,et al.  Exploiting Data Sparsity for Large-Scale Matrix Computations , 2018, Euro-Par.

[2]  David E. Keyes,et al.  Parallel Approximation of the Maximum Likelihood Estimation for the Prediction of Large-Scale Geostatistics Simulations , 2018, 2018 IEEE International Conference on Cluster Computing (CLUSTER).

[3]  E. Tyrtyshnikov Mosaic-Skeleton approximations , 1996 .

[4]  George Bosilca,et al.  Taking Advantage of Hybrid Systems for Sparse Direct Solvers via Task-Based Runtimes , 2014, 2014 IEEE International Parallel & Distributed Processing Symposium Workshops.

[5]  Thomas Hérault,et al.  PTG: An Abstraction for Unhindered Parallelism , 2014, 2014 Fourth International Workshop on Domain-Specific Languages and High-Level Frameworks for High Performance Computing.

[6]  Alexander Aiken,et al.  Legion: Expressing locality and independence with logical regions , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[7]  Ying Sun,et al.  Statistically and Computationally Efficient Estimating Equations for Large Spatial Datasets , 2016 .

[8]  Eric Darve,et al.  A fast block low-rank dense solver with applications to finite-element matrices , 2014, J. Comput. Phys..

[9]  Susan Coghlan,et al.  Operating system issues for petascale systems , 2006, OPSR.

[10]  Mihai Anitescu,et al.  Scalable Gaussian Process Computations Using Hierarchical Matrices , 2018, Journal of Computational and Graphical Statistics.

[11]  Thomas Hérault,et al.  Flexible Development of Dense Linear Algebra Algorithms on Massively Parallel Architectures with DPLASMA , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[12]  Dan Tsafrir,et al.  System noise, OS clock ticks, and fine-grained parallel applications , 2005, ICS '05.

[13]  Jianlin Xia,et al.  A Superfast Structured Solver for Toeplitz Linear Systems via Randomized Sampling , 2012, SIAM J. Matrix Anal. Appl..

[14]  Michael L. Stein,et al.  Limitations on low rank approximations for covariance matrices of spatial data , 2014 .

[15]  Emmanuel Agullo,et al.  Achieving High Performance on Supercomputers with a Sequential Task-based Programming Model , 2017 .

[16]  Steffen Börm,et al.  Data-sparse Approximation by Adaptive ℋ2-Matrices , 2002, Computing.

[17]  Alejandro Duran,et al.  A Proposal to Extend the OpenMP Tasking Model with Dependent Tasks , 2009, International Journal of Parallel Programming.

[18]  James Reinders,et al.  Intel threading building blocks - outfitting C++ for multi-core processor parallelism , 2007 .

[19]  S. Börm Efficient Numerical Methods for Non-local Operators , 2010 .

[20]  Nathan Halko,et al.  Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions , 2009, SIAM Rev..

[21]  George Bosilca,et al.  Hierarchical DAG Scheduling for Hybrid Distributed Systems , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium.

[22]  Patrick R. Amestoy,et al.  Performance and Scalability of the Block Low-Rank Multifrontal Factorization on Multicore Architectures , 2019, ACM Trans. Math. Softw..

[23]  Philippe Olivier Alexandre Navaux,et al.  Performance Improvement of Stencil Computations for Multi-core Architectures based on Machine Learning , 2017, ICCS.

[24]  George Bosilca,et al.  PaRSEC in Practice: Optimizing a Legacy Chemistry Application through Distributed Task-Based Execution , 2015, 2015 IEEE International Conference on Cluster Computing.

[25]  Siegfried Benkner,et al.  Implementing the Open Community Runtime for Shared-Memory and Distributed-Memory Systems , 2016, 2016 24th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP).

[26]  Jack J. Dongarra,et al.  Accelerating NWChem Coupled Cluster through dataflow-based execution , 2018, Int. J. High Perform. Comput. Appl..

[27]  Mario Bebendorf,et al.  Hierarchical Matrices: A Means to Efficiently Solve Elliptic Boundary Value Problems , 2008 .

[28]  Robert A. van de Geijn,et al.  Supermatrix out-of-order scheduling of matrix operations for SMP and multi-core architectures , 2007, SPAA '07.

[29]  Ichitaro Yamazaki,et al.  Evaluation of Programming Models to Address Load Imbalance on Distributed Multi-Core CPUs: A Case Study with Block Low-Rank Factorization , 2019, 2019 IEEE/ACM Parallel Applications Workshop, Alternatives To MPI (PAW-ATM).

[30]  Yu Pei,et al.  Performance Analysis of Tile Low-Rank Cholesky Factorization Using PaRSEC Instrumentation Tools , 2019, 2019 IEEE/ACM International Workshop on Programming and Performance Visualization Tools (ProTools).

[31]  G. Peano Sur une courbe, qui remplit toute une aire plane , 1890 .

[32]  Jack Dongarra,et al.  Numerical linear algebra on emerging architectures: The PLASMA and MAGMA projects , 2009 .

[33]  Bradley C. Kuszmaul,et al.  Cilk: an efficient multithreaded runtime system , 1995, PPOPP '95.

[34]  W. Hackbusch,et al.  Hierarchical Matrices: Algorithms and Analysis , 2015 .

[35]  Cédric Augonnet,et al.  StarPU: a unified platform for task scheduling on heterogeneous multicore architectures , 2011, Concurr. Comput. Pract. Exp..

[36]  Chenhan D. Yu,et al.  Distributed-Memory Hierarchical Compression of Dense SPD Matrices , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.

[37]  George Bosilca,et al.  PaRSEC: A programming paradigm exploiting heterogeneity for enhancing scalability , 2013 .

[38]  Pradipta De,et al.  Impact of Noise on Scaling of Collectives: An Empirical Evaluation , 2006, HiPC.

[39]  Thomas Hérault,et al.  Dynamic task discovery in PaRSEC: a data-flow task-based runtime , 2017, ScalA@SC.

[40]  Ronald Kriemann,et al.  H-LU Factorization on Many-Core Systems , 2014 .

[41]  A. Brandt Multilevel computations of integral transforms and particle interactions with oscillatory kernels , 1991 .

[42]  Leslie Greengard,et al.  A fast algorithm for particle simulations , 1987 .

[43]  Théo Mary,et al.  Block Low-Rank multifrontal solvers: complexity, performance, and scalability. (Solveurs multifrontaux exploitant des blocs de rang faible: complexité, performance et parallélisme) , 2017 .

[44]  Thomas Hérault,et al.  PaRSEC: Exploiting Heterogeneity to Enhance Scalability , 2013, Computing in Science & Engineering.

[45]  R. Parr Density-functional theory of atoms and molecules , 1989 .

[46]  Jean-Yves L'Excellent,et al.  Improving Multifrontal Methods by Means of Block Low-Rank Representations , 2015, SIAM J. Sci. Comput..

[47]  Patrick Amestoy,et al.  MUMPS : A General Purpose Distributed Memory Sparse Solver , 2000, PARA.

[48]  Elisabeth Larsson,et al.  A task parallel implementation of a scattered node stencil-based solver for the shallow water equations , 2013 .

[49]  David E. Keyes,et al.  ExaGeoStat: A High Performance Unified Software for Geostatistics on Manycore Systems , 2017, IEEE Transactions on Parallel and Distributed Systems.

[50]  Andrew Gordon Wilson,et al.  Kernel Interpolation for Scalable Structured Gaussian Processes (KISS-GP) , 2015, ICML.

[51]  David E. Keyes,et al.  Tile Low Rank Cholesky Factorization for Climate/Weather Modeling Applications on Manycore Architectures , 2017, ISC.

[52]  Ronald Kriemann,et al.  $${{\fancyscript{H}}} $$H-LU factorization on many-core systems , 2013, Comput. Vis. Sci..

[53]  Wolfgang Hackbusch,et al.  A Sparse Matrix Arithmetic Based on H-Matrices. Part I: Introduction to H-Matrices , 1999, Computing.

[54]  Eric Darve,et al.  An $$\mathcal O (N \log N)$$O(NlogN)  Fast Direct Solver for Partial Hierarchically Semi-Separable Matrices , 2013 .

[55]  Pieter Ghysels,et al.  A Distributed-Memory Package for Dense Hierarchically Semi-Separable Matrix Computations Using Randomization , 2015, ACM Trans. Math. Softw..

[56]  David E. Keyes,et al.  Real-Time Massively Distributed Multi-object Adaptive Optics Simulations for the European Extremely Large Telescope , 2018, 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS).