Parallel reduction to Hessenberg form with Algorithm-Based Fault Tolerance

This paper studies the resilience of a two-sided factorization and presents a generic algorithm-based approach capable of making two-sided factorizations resilient. We establish the theoretical proof of the correctness and the numerical stability of the approach in the context of a Hessenberg Reduction (HR) and present the scalability and performance results of a practical implementation. Our method is a hybrid algorithm combining an Algorithm Based Fault Tolerance (ABFT) technique with diskless checkpointing to fully protect the data. We protect the trailing and the initial part of the matrix with checksums, and protect finished panels in the panel scope with diskless checkpoints. Compared with the original HR (the ScaLA-PACK PDGEHRD routine) our fault-tolerant algorithm introduces very little overhead, and maintains the same level of scalability. We prove that the overhead shows a decreasing trend as the size of the matrix or the size of the process grid increases.

[1]  Thomas Hérault,et al.  Algorithm-based fault tolerance for dense matrix factorizations , 2012, PPoPP '12.

[2]  A. J. Laub,et al.  Hypercube implementation of some parallel algorithms in control , 1988 .

[3]  Michael W. Berry,et al.  Understanding search engines: mathematical modeling and text retrieval (software , 1999 .

[4]  Mihail M. Konstantinov,et al.  Computational methods for linear control systems , 1991 .

[5]  Charng-da Lu,et al.  Scalable Diskless Checkpointing for Large Parallel Systems , 2005 .

[6]  Henri Casanova,et al.  Using group replication for resilience on exascale systems , 2014, Int. J. High Perform. Comput. Appl..

[7]  Bianca Schroeder,et al.  Understanding failures in petascale computers , 2007 .

[8]  Henri Casanova,et al.  Combining Process Replication and Checkpointing for Resilience on Exascale Systems , 2012 .

[9]  Jack Dongarra,et al.  Scalable techniques for fault tolerant high performance computing , 2006 .

[10]  B. Anderson,et al.  Linear Optimal Control , 1971 .

[11]  John Shalf,et al.  The International Exascale Software Project roadmap , 2011, Int. J. High Perform. Comput. Appl..

[12]  Thomas Hérault,et al.  Unified model for assessing checkpointing protocols at extreme‐scale , 2014, Concurr. Comput. Pract. Exp..

[13]  Jack Dongarra,et al.  Fault tolerant matrix operations using checksum and reverse computation , 1996, Proceedings of 6th Symposium on the Frontiers of Massively Parallel Computation (Frontiers '96).

[14]  R. C. Whaley,et al.  Reducing Floating Point Error in Dot Product Using the Superblock Family of Algorithms , 2008, SIAM J. Sci. Comput..

[15]  Daniel Kressner,et al.  Algorithm 953 , 2015 .

[16]  Christian H. Bischof,et al.  The WY representation for products of householder matrices , 1985, PPSC.

[17]  Daniel Kressner,et al.  On Aggressive Early Deflation in Parallel Variants of the QR Algorithm , 2010, PARA.

[18]  George Bosilca,et al.  Recovery Patterns for Iterative Methods in a Parallel Unstable Environment , 2007, SIAM J. Sci. Comput..

[19]  Daniel Kressner,et al.  A Novel Parallel QR Algorithm for Hybrid Distributed Memory HPC Systems , 2010, SIAM J. Sci. Comput..

[20]  P. Dooren The Computation of Kronecker's Canonical Form of a Singular Pencil , 1979 .

[21]  Amy Nicole Langville,et al.  Google's PageRank and beyond - the science of search engine rankings , 2006 .

[22]  J. G. F. Francis,et al.  The QR Transformation - Part 2 , 1962, Comput. J..

[23]  J. G. F. Francis,et al.  The QR Transformation A Unitary Analogue to the LR Transformation - Part 1 , 1961, Comput. J..

[24]  Rui Wang,et al.  A Case Study of Designing Efficient Algorithm-based Fault Tolerant Application for Exascale Parallelism , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[25]  Alan J. Laub,et al.  A collection of benchmark examples for the numerical solution of algebraic Riccati equations I: Continuous-time case , 1998 .

[26]  Jacob A. Abraham,et al.  Algorithm-Based Fault Tolerance for Matrix Operations , 1984, IEEE Transactions on Computers.

[27]  Karen S. Braman,et al.  The Multishift QR Algorithm. Part II: Aggressive Early Deflation , 2001, SIAM J. Matrix Anal. Appl..

[28]  I. Rosen,et al.  A multilevel technique for the approximate solution of operator Lyapunov and algebraic Riccati equations , 1995 .

[29]  Hui Liu,et al.  High performance linpack benchmark: a fault tolerant implementation without checkpointing , 2011, ICS '11.

[30]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[31]  Thomas Hérault,et al.  Extending the scope of the Checkpoint‐on‐Failure protocol for forward recovery in standard MPI , 2013, Concurr. Comput. Pract. Exp..

[32]  C. Loan,et al.  A Storage-Efficient $WY$ Representation for Products of Householder Transformations , 1989 .

[33]  Gene H. Golub,et al.  Matrix computations , 1983 .

[34]  Karen S. Braman,et al.  The Multishift QR Algorithm. Part I: Maintaining Well-Focused Shifts and Level 3 Performance , 2001, SIAM J. Matrix Anal. Appl..

[35]  George Bosilca,et al.  Algorithm-based fault tolerance applied to high performance computing , 2009, J. Parallel Distributed Comput..

[36]  J. H. Wilkinson The algebraic eigenvalue problem , 1966 .

[37]  Franck Cappello,et al.  Distributed Diskless Checkpoint for Large Scale Systems , 2010, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing.

[38]  Kurt Bryan,et al.  The $25,000,000,000 Eigenvector: The Linear Algebra behind Google , 2006, SIAM Rev..

[39]  James Hardy Wilkinson,et al.  Rounding errors in algebraic processes , 1964, IFIP Congress.

[40]  Gene H. Golub,et al.  Floating Point Fault Tolerance with Backward Error Assertions , 1995, IEEE Trans. Computers.

[41]  BryanKurt,et al.  The $25,000,000,000 Eigenvector , 2006 .

[42]  G. Stewart Matrix Algorithms, Volume II: Eigensystems , 2001 .

[43]  Thomas Hérault,et al.  A Checkpoint-on-Failure Protocol for Algorithm-Based Recovery in Standard MPI , 2012, Euro-Par.

[44]  Kai Li,et al.  Diskless Checkpointing , 1998, IEEE Trans. Parallel Distributed Syst..

[45]  Jack Dongarra,et al.  LAPACK Users' Guide, 3rd ed. , 1999 .

[46]  Jack J. Dongarra,et al.  The LINPACK Benchmark: past, present and future , 2003, Concurr. Comput. Pract. Exp..

[47]  Zizhong Chen,et al.  Algorithmic Cholesky factorization fault recovery , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[48]  Ulrike von Luxburg,et al.  A tutorial on spectral clustering , 2007, Stat. Comput..

[49]  Robert A. van de Geijn,et al.  Reduction to condensed form for the eigenvalue problem on distributed memory architectures , 1992, Parallel Comput..

[50]  Jack Dongarra,et al.  ScaLAPACK Users' Guide , 1987 .

[51]  DongarraJack,et al.  Algorithm-based fault tolerance for dense matrix factorizations , 2012 .

[52]  Anita L. Feller Understanding Search Engines , 2012 .

[53]  Franklin T. Luk,et al.  Fault-Tolerant Matrix Triangularizations on Systolic Arrays , 1988, IEEE Trans. Computers.