Self-adapting software for numerical linear algebra and LAPACK for clusters

This article describes the context, design, and recent development of the LAPACK for clusters (LFC) project. It has been developed in the framework of Self-Adapting Numerical Software (SANS) since we believe such an approach can deliver the convenience and ease of use of existing sequential environments bundled with the power and versatility of highly tuned parallel codes that execute on clusters. Accomplishing this task is far from trivial as we argue in the paper by presenting pertinent case studies and possible usage scenarios.

[1]  Frederick P. Brooks,et al.  No Silver Bullet: Essence and Accidents of Software Engineering , 1987 .

[2]  Fred P. Brooks,et al.  The Mythical Man-Month , 1975, Reliable Software.

[3]  Melvin Klerer,et al.  Interactive Systems for Experimental Applied Mathematics , 1968 .

[4]  Steven G. Johnson,et al.  FFTW: an adaptive software architecture for the FFT , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[5]  G.E. Moore,et al.  Cramming More Components Onto Integrated Circuits , 1998, Proceedings of the IEEE.

[6]  David Abramson,et al.  The Virtual Laboratory: a toolset to enable distributed molecular modelling for drug design on the World‐Wide Grid , 2003, Concurr. Comput. Pract. Exp..

[7]  Victor Eijkhout,et al.  Self-Adapting Numerical Software for Next Generation Applications , 2003, Int. J. High Perform. Comput. Appl..

[8]  Jack Dongarra,et al.  DEPLOYING PARALLEL NUMERICAL LIBRARY ROUTINES TO CLUSTER COMPUTING IN A SELF ADAPTING FASHION , 2002 .

[9]  E. Im,et al.  Optimizing Sparse Matrix Vector Multiplication on SMP , 1999, PPSC.

[10]  Victor Eijkhout,et al.  Algorithmic bombardment for the iterative solution of linear systems: a poly-iterative approach , 1994 .

[11]  Jaeyoung Choi,et al.  Design and Implementation of the ScaLAPACK LU, QR, and Cholesky Factorization Routines , 1994, Sci. Program..

[12]  Message Passing Interface Forum MPI: A message - passing interface standard , 1994 .

[13]  Scott A. Mahlke,et al.  Using profile information to assist classic code optimizations , 1991, Softw. Pract. Exp..

[14]  Frederick P. Brooks,et al.  The Mythical Man-Month: Essays on Softw , 1978 .

[15]  A. J. C. Bik,et al.  Advanced compiler optimizations for sparse computations , 1993, Supercomputing '93.

[16]  Sivan Toledo,et al.  Improving the memory-system performance of sparse-matrix vector multiplication , 1997, IBM J. Res. Dev..

[17]  Ami Marowka,et al.  The GRID: Blueprint for a New Computing Infrastructure , 2000, Parallel Distributed Comput. Pract..

[18]  Jr. Frederick P. Brooks,et al.  The Mythical Man-Month: Essays on Softw , 1978 .

[19]  Richard Wolski,et al.  The network weather service: a distributed resource performance forecasting service for metacomputing , 1999, Future Gener. Comput. Syst..

[20]  James Demmel,et al.  Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology , 1997, ICS '97.

[21]  Edoardo Amaldi,et al.  On the Approximability of Minimizing Nonzero Variables or Unsatisfied Relations in Linear Systems , 1998, Theor. Comput. Sci..

[22]  Ian Foster,et al.  The Grid 2 - Blueprint for a New Computing Infrastructure, Second Edition , 1998, The Grid 2, 2nd Edition.

[23]  Victor Eijkhout LAPACK Working Note 78: Computational Variants of the CGS and BiCGstab Methods , 1994 .

[24]  Katherine A. Yelick,et al.  Optimizing Sparse Matrix Vector Multiplication on SMP , 1999, SIAM Conference on Parallel Processing for Scientific Computing.

[25]  Yves Robert,et al.  Dense linear algebra kernels on heterogeneous platforms: Redistribution issues , 2002, Parallel Comput..

[26]  Jack J. Dongarra,et al.  Algorithm 656: an extended set of basic linear algebra subprograms: model implementation and test programs , 1988, TOMS.

[27]  E. F. Kaasschieter,et al.  A practical termination criterion for the conjugate gradient method , 1988 .

[28]  Jack J. Dongarra,et al.  Automated empirical optimizations of software and the ATLAS project , 2001, Parallel Comput..

[29]  Yuefan Deng,et al.  New trends in high performance computing , 2001, Parallel Computing.

[30]  Rice UniversityCORPORATE,et al.  High performance Fortran language specification , 1993 .

[31]  Sivan Toledo Locality of Reference in LU Decomposition with Partial Pivoting , 1997, SIAM J. Matrix Anal. Appl..

[32]  Jack Dongarra,et al.  Performance Modeling for Self Adapting Collective Communications for MPI , 2001 .

[33]  Rudiger Weiss,et al.  LINSOL (LINear SOLver) - Description and User's Guide for the parallelized version , 1995 .

[34]  Jack J. Dongarra,et al.  Algorithmic Redistribution Methods for Block-Cyclic Decompositions , 1999, IEEE Trans. Parallel Distributed Syst..

[35]  Jan Karel Lenstra,et al.  Approximation algorithms for scheduling unrelated parallel machines , 1987, 28th Annual Symposium on Foundations of Computer Science (sfcs 1987).

[36]  James R. Larus,et al.  Efficient path profiling , 1996, Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO 29.

[37]  R. C. Whaley,et al.  Automatically Tuned Linear Algebra Software (ATLAS) , 2011, Encyclopedia of Parallel Computing.

[38]  Jack Dongarra,et al.  ScaLAPACK Users' Guide , 1987 .

[39]  T. Manteuffel The Tchebychev iteration for nonsymmetric linear systems , 1977 .

[40]  Jack J. Dongarra,et al.  A set of level 3 basic linear algebra subprograms , 1990, TOMS.

[41]  Hans Werner Meuer,et al.  Top500 Supercomputer Sites , 1997 .

[42]  Jaeyoung Choi,et al.  A Proposal for a Set of Parallel Basic Linear Algebra Subprograms , 1995, PARA.

[43]  S. Lennart Johnsson,et al.  Block-Cyclic Dense Linear Algebra , 1993, SIAM J. Sci. Comput..

[44]  Franco Frattolillo,et al.  Parallel computing : advances and current issues , 2002 .

[45]  David B. Shmoys,et al.  A Polynomial Approximation Scheme for Scheduling on Uniform Processors: Using the Dual Approximation Approach , 1988, SIAM J. Comput..

[46]  Alex Rapaport,et al.  Mpi-2: extensions to the message-passing interface , 1997 .

[47]  Pierluigi Crescenzi,et al.  A compendium of NP optimization problems , 1994, WWW Spring 1994.

[48]  Jesús Labarta,et al.  A Framework for Performance Modeling and Prediction , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[49]  Don Coppersmith,et al.  Matrix multiplication via arithmetic progressions , 1987, STOC.

[50]  Erik Elmroth,et al.  New Serial and Parallel Recursive QR Factorization Algorithms for SMP Systems , 1998, PARA.

[51]  James Demmel,et al.  Applied Numerical Linear Algebra , 1997 .

[52]  Dror Irony,et al.  Communication-Efficient Parallel Dense LU Using a3-Dimnsional Approach , 2001, PPSC.

[53]  Sathish S. Vadhiyar,et al.  Numerical Libraries and the Grid , 2001, Int. J. High Perform. Comput. Appl..

[54]  P. Sadayappan,et al.  On improving the performance of sparse matrix-vector multiplication , 1997, Proceedings Fourth International Conference on High-Performance Computing.

[55]  Y. Danieli Guide , 2005 .

[56]  Jordan Gergov,et al.  Approximation Algorithms for Dynamic Storage Allocation , 1996 .

[57]  Corporate Rice University,et al.  High performance Fortran language specification , 1993, FORF.

[58]  P. Kidwell,et al.  The mythical man-month: Essays on software engineering , 1996, IEEE Annals of the History of Computing.

[59]  Jack Dongarra,et al.  Numerical Linear Algebra for High-Performance Computers , 1998 .

[60]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[61]  Jack J. Dongarra,et al.  The LINPACK Benchmark: past, present and future , 2003, Concurr. Comput. Pract. Exp..

[62]  Jack J. Dongarra,et al.  Algorithm 679: A set of level 3 basic linear algebra subprograms: model implementation and test programs , 1990, TOMS.

[63]  Ramesh C. Agarwal,et al.  A high performance algorithm using pre-processing for the sparse matrix-vector multiplication , 1992, Proceedings Supercomputing '92.

[64]  Fred G. Gustavson,et al.  Recursion leads to automatic variable blocking for dense linear-algebra algorithms , 1997, IBM J. Res. Dev..

[65]  Eun Im,et al.  Optimizing the Performance of Sparse Matrix-Vector Multiplication , 2000 .

[66]  Shirley Dex,et al.  JR 旅客販売総合システム(マルス)における運用及び管理について , 1991 .

[67]  V. Strassen Gaussian elimination is not optimal , 1969 .

[68]  James R. Larus,et al.  Exploiting hardware performance counters with flow and context sensitive profiling , 1997, PLDI '97.

[69]  T. Manteuffel Adaptive procedure for estimating parameters for the nonsymmetric Tchebychev iteration , 1978 .

[70]  John R. Rice,et al.  On the construction of polyalgorithms for automatic numerical analysis , 1967, Symposium on Interactive Systems for Experimental Applied Mathematics.

[71]  Viggo Kann,et al.  Strong Lower Bounds on the Approximability of some NPO PB-Complete Maximization Problems , 1995, MFCS.

[72]  Dragan Mirkovic,et al.  Automatic Performance Tuning in the UHFFT Library , 2001, International Conference on Computational Science.

[73]  Matteo Frigo,et al.  A fast Fourier transform compiler , 1999, SIGP.

[74]  Jack J. Dongarra,et al.  Automatically Tuned Linear Algebra Software , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[75]  Jack J. Dongarra,et al.  An extended set of FORTRAN basic linear algebra subprograms , 1988, TOMS.

[76]  Francine Berman,et al.  The GrADS Project: Software Support for High-Level Grid Application Development , 2001, Int. J. High Perform. Comput. Appl..