Quantitative performance modeling of scientific computations and creating locality in numerical algorithms

How do you determine the running time of a program without actually running it? How do you design an efficient out-of-core iterative algorithm? These are the two questions answered in this thesis. The first part of the thesis demonstrates that the performance of programs can be predicted accurately, automatically, and rapidly using a method called benchmapping. The key aspects benchmapping are: automatic creation of detailed performance models, prediction of the performance of runtime system calls using these models, and automatic decomposition of a data-parallel program into a sequence of runtime system calls. The feasibility and utility of benchmapping are established using two performance-prediction systems called P scERFS scIM and B scENCHC scVL. Empirical studies show that P scERFS scIM's relative prediction errors are within 21% and that B scENCHC scVL's relative prediction errors are almost always within 33%. The second part of the thesis presents methods for creating locality in numerical algorithms. Designers of computers, compilers, and runtime systems strive to create designs that exploit the temporal locality of reference found in some programs. Unfortunately, many iterative numerical algorithms lack temporal locality. Executions of such algorithms on current high-performance computers are characterized by saturation of some communication channel (such as a bus or an I/O channel) whereas the CPU is idle most of the time. The thesis demonstrates that a new method for creating locality, called the blocking covers method, can improve the performance of iterative algorithms including multigrid, conjugate gradient, and implicit time stepping. The thesis proves that the method reduces the amount of input-output operations in these algorithms and demonstrates that the method reduces the solution time on workstations by up to a factor of 5. The thesis also describes a parallel linear equation solver which is based on a method called local densification. The method increases the amount of dependencies that can be handled by individual processors but not the amount of dependencies that generate interprocessor communication. An implementation of the resulting algorithm is up to 2.5 times faster than conventional algorithms. (Copies available exclusively from MIT Libraries, Rm. 14-0551, Cambridge, MA 02139-4307. Ph. 617-253-5668; Fax 617-253-1690.)

[1]  Hans Riesel,et al.  A note on large linear systems , 1956 .

[2]  Jon Louis Bentley,et al.  Writing efficient programs , 1982 .

[3]  KremerUlrich,et al.  A static performance estimator to guide data partitioning decisions , 1991 .

[4]  Wilbur H. Highleyman Performance Analysis of Transaction Processing Systems , 1989, SIGMETRICS Perform. Evaluation Rev..

[5]  G. Meurant The block preconditioned conjugate gradient method on vector computers , 1984 .

[6]  Cleve B. Moler,et al.  Matrix computations with Fortran and paging , 1972, CACM.

[7]  W. Daniel Hillis,et al.  The network architecture of the Connection Machine CM-5 (extended abstract) , 1992, SPAA '92.

[8]  Arno Formella,et al.  Isolating the Reasons for the Performance of Parallel Machines on Numerical Programs , 1994, Automatic Parallelization.

[9]  V. Rokhlin Rapid solution of integral equations of classical potential theory , 1985 .

[10]  Thomas J. LeBlanc,et al.  Parallel performance prediction using lost cycles analysis , 1994, Proceedings of Supercomputing '94.

[11]  John R. Gilbert,et al.  Modeling Data-Parallel Programs with the Alignment-Distribution Graph , 1994 .

[12]  Gary L. Miller,et al.  A unified geometric approach to graph separators , 1991, [1991] Proceedings 32nd Annual Symposium of Foundations of Computer Science.

[13]  N. Brenner Fast Fourier transform of externally stored data , 1969 .

[14]  David Chaiken,et al.  Mechanisms and interfaces for software-extended coherent shared memory , 1994 .

[15]  A. Sangiovanni-Vincentelli,et al.  Algorithms For Drift-diff-usion Device Simulation Using Massively Parallel Processors , 1993, [Proceedings] 1993 International Workshop on VLSI Process and Device Modeling (1993 VPAD).

[16]  Shirley Dex,et al.  JR 旅客販売総合システム(マルス)における運用及び管理について , 1991 .

[17]  M. M. Stabrowski A block equation solver for large unsymmetric linear equation systems with dense coefficient matrices , 1987 .

[18]  I. Duff,et al.  The effect of ordering on preconditioned conjugate gradients , 1989 .

[19]  Gilles Cantin An equation solver of very large capacity , 1971 .

[20]  Satish Rao,et al.  Shallow excluded minors and improved graph decompositions , 1994, SODA '94.

[21]  David A. Patterson,et al.  A new approach to I/O performance evaluation: self-scaling I/O benchmarks, predicted I/O performance , 1994, TOCS.

[22]  Reinhold Weicker,et al.  A detailed look at some popular benchmarks , 1991, Parallel Comput..

[23]  David A. Patterson,et al.  A new approach to I/O performance evaluation: self-scaling I/O benchmarks, predicted I/O performance , 1993, SIGMETRICS '93.

[24]  W. M. Gentleman,et al.  Fast Fourier Transforms: for fun and profit , 1966, AFIPS '66 (Fall).

[25]  Gary L. Miller,et al.  Separators in two and three dimensions , 1990, STOC '90.

[26]  Richard E. Twogood,et al.  An Extension of Eklundh's Matrix Transposition Algorithm and Its Application in Digital Image Processing , 1976, IEEE Transactions on Computers.

[27]  J. E. Kelley An Application of Linear Programming to Curve Fitting , 1958 .

[28]  Guy L. Steele,et al.  Data Optimization: Allocation of Arrays to Reduce Communication on SIMD Machines , 1990, J. Parallel Distributed Comput..

[29]  Joseph W. H. Liu,et al.  On the storage requirement in the out-of-core multifrontal method for sparse factorization , 1986, TOMS.

[30]  Y. Saad,et al.  Practical Use of Polynomial Preconditionings for the Conjugate Gradient Method , 1985 .

[31]  Thomas Fahringer,et al.  A static parameter based performance prediction tool for parallel programs , 1993, ICS '93.

[32]  John Noye,et al.  Finite Difference Techniques for Partial Differential Equations , 1984 .

[33]  Jan Mandel,et al.  An iterative solver for p-version finite elements in three dimensions , 1994 .

[34]  Guy L. Steele,et al.  The High Performance Fortran Handbook , 1993 .

[35]  Seth Copen Goldstein,et al.  Active messages: a mechanism for integrating communication and computation , 1998, ISCA '98.

[36]  D. W. Barron,et al.  Solution of Simultaneous Linear Equations using a Magnetic-Tape Store , 1960, Computer/law journal.

[37]  Subhash Saini,et al.  NAS Parallel Benchmarks Results 3-95 , 1995 .

[38]  Michael T. Heath,et al.  Solution of Large-Scale Sparse Least Squares Problems Using Auxiliary Storage , 1981 .

[39]  Rice UniversityCORPORATE,et al.  High performance Fortran language specification , 1993 .

[40]  Saul Rosen,et al.  Electronic Computers: A Historical Survey , 1969, CSUR.

[41]  Peter Ming-Chien Chen Input-output performance evaluation: self-scaling benchmarks, predicted performance , 1992 .

[42]  J. Gillis,et al.  Matrix Iterative Analysis , 1961 .

[43]  C.-C. Jay Kuo,et al.  Two-Color Fourier Analysis of Iterative Algorithms for Elliptic Problems with Red/Black Ordering , 1990, SIAM J. Sci. Comput..

[44]  Guy E. Blelloch,et al.  Implementation of a portable nested data-parallel language , 1993, PPOPP '93.

[45]  Eric A. Brewer,et al.  High-level optimization via automated statistical modeling , 1995, PPOPP '95.

[46]  Donald MacKenzie,et al.  The Influence of the Los Alamos and Livermore National Laboratories on the Development of Supercomputing , 1991, Annals of the History of Computing.

[47]  Edward G. Coffman,et al.  Organizing matrices and matrix operations for paged memory systems , 1969, Commun. ACM.

[48]  H. A. Van Der Vorst,et al.  M) ICCG for 2D problems on vectorcomputers , 1987 .

[49]  Marina C. Chen,et al.  The Data Alignment Phase in Compiling Programs for Distrubuted-Memory Machines , 1991, J. Parallel Distributed Comput..

[50]  C. Loan Computational Frameworks for the Fast Fourier Transform , 1992 .

[51]  Eric A. Brewer,et al.  Portable high-performance superconducting: high-level platform-dependent optimization , 1994 .

[52]  P. J. Denning QUEUEING MODELS FOR FILE MEMORY OPERATION , 1965 .

[53]  Richard Barrett,et al.  Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods , 1994, Other Titles in Applied Mathematics.

[54]  J. Demmel Numerical linear algebra , 1993 .

[55]  Stanley C. Eisenstat,et al.  Software for Sparse Gaussian Elimination with Limited Core Storage. , 1978 .

[56]  A. George,et al.  Auxiliary Storage Methods for Solving Finite Element Systems , 1985 .

[57]  Joseph W. H. Liu,et al.  The multifrontal method and paging in sparse Cholesky factorization , 1989, TOMS.

[58]  David H. Bailey,et al.  FFTs in external or hierarchical memory , 1989, Proceedings of the 1989 ACM/IEEE Conference on Supercomputing (Supercomputing '89).

[59]  Ken Kennedy,et al.  A static performance estimator to guide data partitioning decisions , 1991, PPOPP '91.

[60]  U. Schumann,et al.  Comments on "A Fast Computer Method for Matrix Transposing" and Application to the Solution of Poisson's Equation , 1973, IEEE Trans. Computers.

[61]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[62]  W. Daniel Hillis,et al.  The Network Architecture of the Connection Machine CM-5 , 1996, J. Parallel Distributed Comput..

[63]  Roland W. Freund,et al.  On Adaptive Weighted Polynomial Preconditioning for Hermitian Positive Definite Matrices , 1994, SIAM J. Sci. Comput..

[64]  Philip H. Dorn,et al.  The Soul of a New Machine , 1982, Annals of the History of Computing.

[65]  Klaus-Jürgen Bathe,et al.  Direct solution of large systems of linear equations , 1974 .

[66]  Horst D. Simon,et al.  Solution of large, dense symmetric generalized eigenvalue problems using secondary storage , 1988, TOMS.

[67]  Guy E. Blelloch,et al.  Scan primitives and parallel vector models , 1989 .

[68]  Petter E. Bjørstad,et al.  A large scale, sparse, secondary storage, direct linear equation solver for structural analysis and its implementation on vector and parallel architectures , 1987, Parallel Comput..

[69]  H. T. Kung Memory requirements for balanced computer architectures , 1986, ISCA '86.

[70]  Raj Jain,et al.  The art of computer systems performance analysis - techniques for experimental design, measurement, simulation, and modeling , 1991, Wiley professional computing.

[71]  R. C. Malone,et al.  Parallel ocean general circulation modeling , 1992 .

[72]  J. O. Eklundh,et al.  A Fast Computer Method for Matrix Transposing , 1972, IEEE Transactions on Computers.

[73]  M. M. Stabrowski A block equation solver for large unsymmetric matrices arising in the boundary integral equation method , 1985 .

[74]  Katta G. Murty,et al.  Linear complementarity, linear and nonlinear programming , 1988 .

[75]  L. J. Comrie,et al.  Mathematical Tables and Other Aids to Computation. , 1946 .

[76]  N. B. MacDonald Predicting Execution Times of Sequential Scientific Kernels , 1994, Automatic Parallelization.

[77]  Bernd Fischer Roland W. Freund An Inner Product-Free Conjugate Gradient-Like Algorithm for Hermitian Positive Definite Systems , 1994 .

[78]  Satish Rao,et al.  New graph decompositions and fast emulations in hypercubes and butterflies , 1993, SPAA '93.

[79]  Joseph W. H. Liu,et al.  The Multifrontal Method for Sparse Matrix Solution: Theory and Practice , 1992, SIAM Rev..

[80]  M. Hestenes,et al.  Methods of conjugate gradients for solving linear systems , 1952 .

[81]  Baruch Awerbuch,et al.  Sparse partitions , 1990, Proceedings [1990] 31st Annual Symposium on Foundations of Computer Science.

[82]  A. George Nested Dissection of a Regular Finite Element Mesh , 1973 .

[83]  Elizabeth H. Cuthill,et al.  Digital Computers in Nuclear Reactor Design , 1964, Adv. Comput..

[84]  John K. Reid,et al.  Solving Large Full Sets of Linear Equations in a Paged Virtual Store , 1981, TOMS.

[85]  Alok Aggarwal,et al.  The input/output complexity of sorting and related problems , 1988, CACM.

[86]  G. Golub,et al.  Iterative solution of linear systems , 1991, Acta Numerica.

[87]  H. T. Kung,et al.  I/O complexity: The red-blue pebble game , 1981, STOC '81.

[88]  Sivan Toledo,et al.  Efficient Out-of-Core Algorithms for Linear Relaxation Using Blocking Covers , 1997, J. Comput. Syst. Sci..

[89]  Daniel A. Reed,et al.  Performance observability , 1990 .

[90]  Graham H. Powell,et al.  Large capacity equation solver for structural analysis , 1974 .

[91]  A. L. Scherr,et al.  AN ANALYSIS OF TIME-SHARED COMPUTER SYSTEMS , 1965 .

[92]  R. Grimes,et al.  On vectorizing incomplete factorization and SSOR preconditioners , 1988 .

[93]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[94]  Guy E. Blelloch,et al.  NESL: A Nested Data-Parallel Language (Version 2.6) , 1993 .

[95]  Guy E. Blelloch,et al.  Implementation of a portable nested data-parallel language , 1993, PPOPP '93.

[96]  Sharon E. Perl Performance assertion checking , 1993, SOSP '93.

[97]  I. Gustafsson A class of first order factorization methods , 1978 .

[98]  Edward D. Lazowska,et al.  Quantitative system performance - computer system analysis using queueing network models , 1983, Int. CMG Conference.

[99]  R. Singleton,et al.  A method for computing the fast Fourier transform with auxiliary memory and limited high-speed storage , 1967, IEEE Transactions on Audio and Electroacoustics.

[100]  James V. Beck,et al.  Parameter Estimation in Engineering and Science , 1977 .

[101]  William Orchard-Hays,et al.  Advanced Linear-Programming Computing Techniques , 1968 .

[102]  Edward D. Lazowska,et al.  Quantitative System Performance , 1985, Int. CMG Conference.

[103]  Anthony T. Chronopoulos,et al.  s-step iterative methods for symmetric linear systems , 1989 .

[104]  William L. Briggs,et al.  A multigrid tutorial , 1987 .

[105]  Dennis Gannon,et al.  Building analytical models into an interactive performance prediction tool , 1989, Proceedings of the 1989 ACM/IEEE Conference on Supercomputing (Supercomputing '89).

[106]  Kevin J. M. Moriarty,et al.  A Modified Conjugate Gradient Solver for Very Large Systems , 1985, ICPP.