Automatic performance tuning of sparse matrix kernels

This dissertation presents an automated system to generate highly efficient, platform-adapted implementations of sparse matrix kernels. We show that conventional implementations of important sparse kernels like sparse matrix-vector multiply (SpMV) have historically run at 10% or less of peak machine speed on cache-based superscalar architectures. Our implementations of SpMV, automatically tuned using a methodology based on empirical-search, can by contrast achieve up to 31% of peak machine speed, and can be up to 4× faster. Given a matrix, kernel, and machine; our approach to selecting a fast implementation consists of two steps: (1) we identify and generate a space of reasonable implementations, and then (2) search this space for the fastest one using a combination of heuristic models and actual experiments (i.e., running and timing the code). We build on the SPARSITY system for generating highly-tuned implementations of the SpMV kernel y ← y + Ax, where A is a sparse matrix and x, y are dense vectors. We extend SPARSITY to support tuning for a variety of common non-zero patterns arising in practice, and for additional kernels like sparse triangular solve (SpTS) and computation of ATA·x (or AAT·x) and A ρ·x. We develop new models to compute, for particular data structures and kernels, the best absolute performance (e.g., Mflop/s) we might expect on a given matrix and machine. These performance upper bounds account for the cost of memory operations at all levels of the memory hierarchy, but assume ideal instruction scheduling and low-level tuning. We evaluate our performance with respect to such bounds, finding that the generated and tuned implementations of SpMV and SpTS achieve up to 75% of the performance bound. This finding places limits on the effectiveness of additional low-level tuning (e.g., better instruction selection and scheduling). (Abstract shortened by UMI.)

[1]  A. Kolmogoroff Confidence Limits for an Unknown Distribution Function , 1941 .

[2]  Z. Birnbaum Numerical Tabulation of the Distribution of Kolmogorov's Statistic for Finite Sample Size , 1952 .

[3]  G. E. Noether Note on the kolmogorov statistic in the discrete case , 1963 .

[4]  J. Tukey,et al.  An algorithm for the machine calculation of complex Fourier series , 1965 .

[5]  J. W. Walker,et al.  Direct solutions of sparse network equations by optimally ordered triangular factorization , 1967 .

[6]  E. Cuthill,et al.  Reducing the bandwidth of sparse symmetric matrices , 1969, ACM '69.

[7]  Donald E. Knuth,et al.  An empirical study of FORTRAN programs , 1971, Softw. Pract. Exp..

[8]  David Siegmund,et al.  Great expectations: The theory of optimal stopping , 1971 .

[9]  D. Rose A GRAPH-THEORETIC STUDY OF THE NUMERICAL SOLUTION OF SPARSE POSITIVE DEFINITE SYSTEMS OF LINEAR EQUATIONS , 1972 .

[10]  Udo W. Pooch,et al.  A Survey of Indexing Techniques for Sparse Matrices , 1973, CSUR.

[11]  A. George Nested Dissection of a Regular Finite Element Mesh , 1973 .

[12]  John R. Rice,et al.  The Algorithm Selection Problem , 1976, Adv. Comput..

[13]  P. Bickel,et al.  Mathematical Statistics: Basic Ideas and Selected Topics , 1977 .

[14]  Fred G. Gustavson,et al.  Two Fast Algorithms for Sparse Matrices: Multiplication and Permuted Transposition , 1978, TOMS.

[15]  Alan George,et al.  The Design of a User Interface for a Sparse Matrix Package , 1979, TOMS.

[16]  Charles L. Lawson,et al.  Basic Linear Algebra Subprograms for Fortran Usage , 1979, TOMS.

[17]  H. T. Kung,et al.  I/O complexity: The red-blue pebble game , 1981, STOC '81.

[18]  Susan L. Graham,et al.  Gprof: A call graph execution profiler , 1982, SIGPLAN '82.

[19]  Thomas R. Gross,et al.  Postpass Code Optimization of Pipeline Constraints , 1983, TOPL.

[20]  Anne Lohrli Chapman and Hall , 1985 .

[21]  John R. Rice,et al.  Solving elliptic problems using ELLPACK , 1985, Springer series in computational mathematics.

[22]  Katherine Yelick,et al.  Performance Optimizations and Bounds for Sparse Symmetric Matrix-Multiple Vector Multiply , 1985 .

[23]  I. Duff,et al.  Direct Methods for Sparse Matrices , 1987 .

[24]  Henry Massalin Superoptimizer: a look at the smallest program , 1987, ASPLOS 1987.

[25]  J. Rice Mathematical Statistics and Data Analysis , 1988 .

[26]  Thomas F. Coleman,et al.  A parallel triangular solver for distributed-memory multiprocessor , 1988 .

[27]  Ken Kennedy,et al.  Estimating Interlock and Improving Balance for Pipelined Architectures , 1988, J. Parallel Distributed Comput..

[28]  Thomas S. Ferguson,et al.  Who Solved the Secretary Problem , 1989 .

[29]  Y. Saad,et al.  Krylov Subspace Methods on Supercomputers , 1989 .

[30]  Jack J. Dongarra,et al.  A set of level 3 basic linear algebra subprograms , 1990, TOMS.

[31]  Monica S. Lam,et al.  The cache performance and optimizations of blocked algorithms , 1991, ASPLOS IV.

[32]  Joel H. Saltz,et al.  Run-Time Parallelization and Scheduling of Loops , 1991, IEEE Trans. Computers.

[33]  Timothy A. Davis,et al.  An Unsymmetric-pattern Multifrontal Method for Sparse Lu Factorization , 1993 .

[34]  Scott A. Mahlke,et al.  Using profile information to assist classic code optimizations , 1991, Softw. Pract. Exp..

[35]  Monica S. Lam,et al.  A data locality optimizing algorithm , 1991, PLDI '91.

[36]  Rafael Hector Saavedra-Barrera,et al.  CPU performance evaluation and execution time prediction using narrow spectrum benchmarking , 1992 .

[37]  Richard Kenner,et al.  Eliminating branches using a superoptimizer and the GNU C compiler , 1992, PLDI '92.

[38]  Anoop Gupta,et al.  Parallel ICCG on a hierarchical memory multiprocessor - Addressing the triangular solve bottleneck , 1990, Parallel Comput..

[39]  Michael Lucks,et al.  Automated selection of mathematical software , 1992, TOMS.

[40]  John R. Gilbert,et al.  Sparse Matrices in MATLAB: Design and Implementation , 1992, SIAM J. Matrix Anal. Appl..

[41]  David H. Bailey,et al.  NAS parallel benchmark results , 1992, Proceedings Supercomputing '92.

[42]  Ken Kennedy,et al.  Compiler blockability of numerical algorithms , 1992, Proceedings Supercomputing '92.

[43]  Olivier Temam,et al.  Characterizing the behavior of sparse algorithms on caches , 1992, Proceedings Supercomputing '92.

[44]  Fernando L. Alvarado,et al.  Optimal Parallel Solution of Sparse Triangular Systems , 1993, SIAM J. Sci. Comput..

[45]  Alexander A. Stepanov,et al.  Algorithm‐oriented generic libraries , 1994, Softw. Pract. Exp..

[46]  Mark T. Jones,et al.  Scalable Iterative Solution of Sparse Linear Systems , 1994, Parallel Comput..

[47]  S. CohnData Assessing the Eeects of Data Selection with Dao's Physical-space Statistical Analysis System , 1994 .

[48]  Richard Barrett,et al.  Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods , 1994, Other Titles in Applied Mathematics.

[49]  Susan T. Dumais,et al.  Using Linear Algebra for Intelligent Information Retrieval , 1995, SIAM Rev..

[50]  Kathryn S. McKinley,et al.  Tile size selection using cache organization and data layout , 1995, PLDI '95.

[51]  Eunice E. Santos Solving triangular linear systems in parallel using substitution , 1995, Proceedings.Seventh IEEE Symposium on Parallel and Distributed Processing.

[52]  Edward Rothberg,et al.  Alternatives for Solving Sparse Triangular Systems on Distributed-Memory Multiprocessors , 1995, Parallel Comput..

[53]  Weichung Wang,et al.  Adaptive use of iterative methods in interior point methods for linear programming , 1995 .

[54]  Vipin Kumar,et al.  Parallel Algorithms for Forward Elimination and Backward Substitution in Direct Solution of Sparse L , 1995 .

[55]  John E. Savage Extending the Hong-Kung Model to Memory Hierarchies , 1995, COCOON.

[56]  Sally A. McKee,et al.  Hitting the memory wall: implications of the obvious , 1995, CARN.

[57]  Eric A. Brewer,et al.  High-level optimization via automated statistical modeling , 1995, PPOPP '95.

[58]  Aart J. C. Bik,et al.  Advanced Compiler Optimizations for Sparse Computations , 1995, J. Parallel Distributed Comput..

[59]  Rajiv Gupta,et al.  Adaptive loop transformations for scientific programs , 1995, Proceedings.Seventh IEEE Symposium on Parallel and Distributed Processing.

[60]  Preston Briggs Sparse matrix multiplication , 1996, SIGP.

[61]  William Gropp,et al.  MPI-2: Extending the Message-Passing Interface , 1996, Euro-Par, Vol. I.

[62]  J. R. Johnson,et al.  Implementation of Strassen's Algorithm for Matrix Multiplication , 1996, Proceedings of the 1996 ACM/IEEE Conference on Supercomputing.

[63]  Josep-Lluís Larriba-Pey,et al.  Block algorithms for sparse matrix computations on high performance workstations , 1996, ICS '96.

[64]  Craig C. Douglas,et al.  Caching in with Multigrid Algorithms: Problems in Two Dimensions , 1996, Parallel Algorithms Appl..

[65]  Bowen Alpern,et al.  Hierarchical Tiling: A Methodology for High Performance , 1996 .

[66]  William Gropp,et al.  Efficient Management of Parallelism in Object-Oriented Numerical Software Libraries , 1997, SciTools.

[67]  Sandra Fillebrown,et al.  The MathWorks' MATLAB , 1996 .

[68]  Patrick R. Amestoy,et al.  An Approximate Minimum Degree Ordering Algorithm , 1996, SIAM J. Matrix Anal. Appl..

[69]  Richard F. Barrett,et al.  Matrix Market: a web resource for test matrix collections , 1996, Quality of Numerical Software.

[70]  Chau-Wen Tseng,et al.  Improving data locality with loop transformations , 1996, TOPL.

[71]  S. Bikhchandani,et al.  Optimal Search with Learning , 2011 .

[72]  James R. Larus,et al.  Efficient path profiling , 1996, Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO 29.

[73]  Paul Vinson Stodghill,et al.  A Relational Approach to the Automatic Generation of Sequential Sparse matrix Codes , 1997 .

[74]  J. Demmel,et al.  Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology , 1997 .

[75]  Jeremy D. Frens,et al.  Auto-blocking matrix-multiplication or tracking BLAS3 performance from source code , 1997, PPOPP '97.

[76]  Martin C. Rinard,et al.  Dynamic feedback: an effective technique for adaptive computing , 1997, PLDI '97.

[77]  M. SIAMJ. FAST NESTED DISSECTION FOR FINITE ELEMENT MESHES , 1997 .

[78]  James Demmel,et al.  Applied Numerical Linear Algebra , 1997 .

[79]  Sivan Toledo,et al.  Improving the memory-system performance of sparse-matrix vector multiplication , 1997, IBM J. Res. Dev..

[80]  John R. Gilbert,et al.  Aspect-Oriented Programming of Sparse Matrix Code , 1997, ISCOPE.

[81]  P. Sadayappan,et al.  On improving the performance of sparse matrix-vector multiplication , 1997, Proceedings Fourth International Conference on High-Performance Computing.

[82]  Vipin Kumar,et al.  A high performance two dimensional scalable parallel algorithm for solving sparse triangular systems , 1997, Proceedings Fourth International Conference on High-Performance Computing.

[83]  Sivan Toledo Locality of Reference in LU Decomposition with Partial Pivoting , 1997, SIAM J. Matrix Anal. Appl..

[84]  Michael B. Giles,et al.  Renumbering unstructured grids to improve the performance of codes on hierarchical memory machines , 1997 .

[85]  Mark Leone,et al.  Dynamo: A Staged Compiler Architecture for Dynamic Program Optimization , 1997 .

[86]  Florin Dobrian,et al.  Object-Oriented Design for Sparse Direct Solvers , 1998, ISCOPE.

[87]  Aart J. C. Bik,et al.  The automatic generation of sparse primitives , 1998, TOMS.

[88]  Jack Dongarra,et al.  Developing numerical libraries in Java , 1998 .

[89]  Stefan Andersson,et al.  RS/6000 Scientific and Technical Computing: POWER3 Introduction and Tuning Guide , 1998 .

[90]  Jack J. Dongarra,et al.  Automatically Tuned Linear Algebra Software , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[91]  Jeremy G. Siek,et al.  A Rational Approach to Portable High Performance: The Basic Linear Algebra Instruction Set (BLAIS) and the Fixed Algorithm Size Template (FAST) Library , 1998, ECOOP Workshops.

[92]  Todd L. Veldhuizen,et al.  Arrays in Blitz++ , 1998, ISCOPE.

[93]  Bo Kågström,et al.  GEMM-based level 3 BLAS: high-performance model implementations and performance evaluation benchmark , 1998, TOMS.

[94]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[95]  James Demmel,et al.  The PHiPAC v1.0 Matrix-Multiply Distribution , 1998 .

[96]  Brendan J. Frey,et al.  Graphical Models for Machine Learning and Digital Communication , 1998 .

[97]  Kanad Ghose,et al.  Caching-efficient multithreaded fast multiplication of sparse matrices , 1998, Proceedings of the First Merged International Parallel Processing Symposium and Symposium on Parallel and Distributed Processing.

[98]  Clark D. Thomborson,et al.  Data cache parameter measurements , 1998, Proceedings International Conference on Computer Design. VLSI in Computers and Processors (Cat. No.98CB36273).

[99]  Edith Cohen,et al.  Structure Prediction and Computation of Sparse Matrix Products , 1998, J. Comb. Optim..

[100]  Dennis Gannon,et al.  Active Libraries: Rethinking the roles of compilers and libraries , 1998, ArXiv.

[101]  G.E. Moore,et al.  Cramming More Components Onto Integrated Circuits , 1998, Proceedings of the IEEE.

[102]  Mithuna Thottethodi,et al.  Tuning Strassen's Matrix Multiplication for Memory Efficiency , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[103]  S. Cohn,et al.  Assessing the Effects of Data Selection with the DAO Physical-Space Statistical Analysis System* , 1998 .

[104]  Vipin Kumar,et al.  A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs , 1998, SIAM J. Sci. Comput..

[105]  Charles Consel,et al.  Tempo: specializing systems applications and beyond , 1998, CSUR.

[106]  Steven G. Johnson,et al.  FFTW: an adaptive software architecture for the FFT , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[107]  James Demmel,et al.  Multigrid equation solvers for large-scale nonlinear finite element simulations , 1999 .

[108]  Matteo Frigo,et al.  Cache-oblivious algorithms , 1999, 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039).

[109]  Dawson R. Engler,et al.  C and tcc: a language and compiler for dynamic code generation , 1999, TOPL.

[110]  Roman Geus,et al.  Towards a fast parallel sparse matrix-vector multiplication , 2000, PARCO.

[111]  Cleve Ashcraft,et al.  SPOOLES: An Object-Oriented Sparse Matrix Library , 1999, PPSC.

[112]  Craig S. K. Clapp,et al.  Instruction-level Parallelism in AES Candidates , 1999 .

[113]  James Demmel,et al.  A Supernodal Approach to Sparse Partial Pivoting , 1999, SIAM J. Matrix Anal. Appl..

[114]  Paul van der Mark,et al.  Using Iterative Compilation for Managing Software Pipeline-Unrolling Trade-offs , 1999 .

[115]  Francisco F. Rivera,et al.  Modeling and Improving Locality for Irregular Problems: Sparse Matrix-Vector Product on Cache Memories as a Cache Study , 1999, HPCN Europe.

[116]  Vipin Kumar,et al.  PSPASES: An Efficient and Scalable Parallel Sparse Direct Solver , 1999, PPSC.

[117]  James Demmel,et al.  LAPACK Users' Guide, Third Edition , 1999, Software, Environments and Tools.

[118]  A. Pinar,et al.  Improving Performance of Sparse Matrix-Vector Multiplication , 1999, ACM/IEEE SC 1999 Conference (SC'99).

[119]  John C. Platt,et al.  Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[120]  Matteo Frigo,et al.  A fast Fourier transform compiler , 1999, SIGP.

[121]  Keshav Pingali,et al.  A case for source-level transformations in MATLAB , 1999, DSL '99.

[122]  Sharad Malik,et al.  Cache miss equations: a compiler framework for analyzing and tuning memory behavior , 1999, TOPL.

[123]  Taher H. Haveliwala Efficient Computation of PageRank , 1999 .

[124]  Michael T. Heath,et al.  Performance of Parallel Sparse Triangular Solution , 1999 .

[125]  Emilio L. Zapata,et al.  Automatic analytical modeling for the estimation of cache misses , 1999, 1999 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.PR00425).

[126]  Aart J. C. Bik,et al.  Automatic Nonzero Structure Analysis , 1999, SIAM J. Comput..

[127]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[128]  Katherine A. Yelick,et al.  Optimizing Sparse Matrix Vector Multiplication on SMP , 1999, SIAM Conference on Parallel Processing for Scientific Computing.

[129]  E. Im,et al.  Optimizing Sparse Matrix Vector Multiplication on SMP , 1999, PPSC.

[130]  Jon Kleinberg,et al.  Authoritative sources in a hyperlinked environment , 1999, SODA '98.

[131]  Kang Su Gatlin,et al.  Architecture-Cognizant Divide and Conquer Algorithms , 1999, ACM/IEEE SC 1999 Conference (SC'99).

[132]  Michael Voss,et al.  ADAPT: Automated De-coupled Adaptive Program Transformation , 2000, Proceedings 2000 International Conference on Parallel Processing.

[133]  John Worley,et al.  AES Finalists on PA-RISC and IA-64: Implementations & Performance , 2000, AES Candidate Conference.

[134]  Y. Saad,et al.  Iterative solution of linear systems in the 20th century , 2000 .

[135]  Lawrence E. Bassham Efficiency Testing of ANSI C Implementations of Round 2 Candidate Algorithms for the Advanced Encryption Standard , 2000, AES Candidate Conference.

[136]  Eun Im,et al.  Optimizing the Performance of Sparse Matrix-Vector Multiplication , 2000 .

[137]  Naren Ramakrishnan,et al.  Note on generalization in experimental algorithmics , 2000, TOMS.

[138]  James Demmel,et al.  Statistical Modeling of Feedback Data in an Automatic Tuning System , 2000 .

[139]  C. F. Jeff Wu,et al.  Experiments: Planning, Analysis, and Parameter Design Optimization , 2000 .

[140]  Dragan Mirkovic,et al.  An adaptive software library for fast Fourier transforms , 2000, ICS '00.

[141]  Bryan Weeks,et al.  Hardware Performance Simulations of Round 2 Advanced Encryption Standard Algorithms , 2000, AES Candidate Conference.

[142]  Michail G. Lagoudakis,et al.  Algorithm Selection using Reinforcement Learning , 2000, ICML.

[143]  Andy Nisbet,et al.  GAPS: Iterative Feedback Directed Parallelisation Using Genetic Algorithms , 2000 .

[144]  Jack J. Dongarra,et al.  A Scalable Cross-Platform Infrastructure for Application Performance Tuning Using Hardware Counters , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[145]  William Gropp,et al.  Latency, bandwidth, and concurrent issue limitations in high-performance CFD. , 2000 .

[146]  M. Challacombe A general parallel sparse-blocked matrix multiply for linear scaling SCF theory , 2000 .

[147]  Jeffrey Scott Vitter,et al.  Efficient Sorting Using Registers and Caches , 2000, Algorithm Engineering.

[148]  Fred G. Gustavson,et al.  LAWRA: Linear Algebra with Recursive Algorithms , 2000, PARA.

[149]  Michael D. Smith,et al.  Overcoming the Challenges to Feedback-Directed Optimization , 2000, Dynamo.

[150]  Andrei Z. Broder,et al.  Graph structure in the Web , 2000, Comput. Networks.

[151]  T. Kisuki,et al.  Iterative Compilation in Program Optimization , 2000 .

[152]  Manuela M. Veloso,et al.  Learning to Predict Performance from Formula Modeling and Training Data , 2000, ICML.

[153]  Bruce Schneier,et al.  A Performance Comparison of the Five AES Finalists , 2000, AES Candidate Conference.

[154]  Keshav Pingali,et al.  A Framework for Sparse Matrix Code Synthesis from High-level Specifications , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[155]  Keshav Pingali,et al.  Next-generation generic programming and its application to sparse matrix computations , 2000, ICS '00.

[156]  Matthew Arnold,et al.  Adaptive Optimization in the Jalapeo JVM: The Controller's Analytical Model , 2000 .

[157]  James C. Browne,et al.  Compositional Development of Performance Models in Poems , 2000, Int. J. High Perform. Comput. Appl..

[158]  Richard Weiss,et al.  A Comparison of AES Candidates on the Alpha 21264 , 2000, AES Candidate Conference.

[159]  Michele Colajanni,et al.  PSBLAS: a library for parallel linear algebra computation on sparse matrices , 2000, TOMS.

[160]  Michael A. Bender,et al.  Cache-oblivious B-trees , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[161]  Naren Ramakrishnan,et al.  PYTHIA-II: a knowledge/database system for managing performance data and recommending scientific software , 2000, TOMS.

[162]  Markus Mock,et al.  DyC: an expressive annotation-directed dynamic compiler for C , 2000, Theor. Comput. Sci..

[163]  C. Thomborson,et al.  MEASURING DATA CACHE AND TLB PARAMETERS UNDER LINUX , 2000 .

[164]  Ken Kennedy,et al.  Transforming loops to recursion for multi-level memory hierarchies , 2000, PLDI '00.

[165]  Gerd Heber,et al.  Self‐avoiding walks over adaptive unstructured grids , 2000 .

[166]  Keshav Pingali,et al.  The Bernoulli Generic Matrix Library , 2000 .

[167]  Ulrich Rüde,et al.  Cache Optimization for Structured and Unstructured Grid Multigrid , 2000 .

[168]  Fumihiko Sano,et al.  Performance Evaluation of AES Finalists on the High-End Smart Card , 2000, AES Candidate Conference.

[169]  Siddhartha Chatterjee,et al.  Cache-Efficient Multigrid Algorithms , 2001, Int. J. High Perform. Comput. Appl..

[170]  Kunle Olukotun,et al.  High Bandwidth On-Chip Cache Design , 2001, IEEE Trans. Computers.

[171]  Jeremy D. Frens,et al.  Language support for Morton-order matrices , 2001, PPoPP '01.

[172]  Patrick Amestoy,et al.  A Fully Asynchronous Multifrontal Solver Using Distributed Dynamic Scheduling , 2001, SIAM J. Matrix Anal. Appl..

[173]  José M. F. Moura,et al.  Fast Automatic Generation of DSP Algorithms , 2001, International Conference on Computational Science.

[174]  Joseph L. Hellerstein,et al.  Using Control Theory to Achieve Service Level Objectives In Performance Management , 2001, 2001 IEEE/IFIP International Symposium on Integrated Network Management Proceedings. Integrated Network Management VII. Integrated Management Strategies for the New Millennium (Cat. No.01EX470).

[175]  James Demmel,et al.  Statistical Models for Automatic Performance Tuning , 2001, International Conference on Computational Science.

[176]  Larry Carter,et al.  A Modal Model of Memory , 2001, International Conference on Computational Science.

[177]  Roldan Pozo,et al.  NIST sparse BLAS user's guide , 2001 .

[178]  James Demmel,et al.  Preconditioning sparse matrices for computing eigenvalues and solving linear systems of equations , 2001 .

[179]  Greg M. Henry,et al.  Flexible High-Performance Matrix Multiply via a Self-Modifying Runtime Code , 2001 .

[180]  Larry Carter,et al.  Rescheduling for Locality in Sparse Matrix Computations , 2001, International Conference on Computational Science.

[181]  K. Cooper,et al.  Compilation Order Matters , 2001 .

[182]  William Kahan,et al.  Document for the Basic Linear Algebra Subprograms (BLAS) standard: BLAS Technical Forum , 2001 .

[183]  Dr. Andy P. Nisbet,et al.  Towards Retargettable Compilers — Feedback Directed Compilation Using Genetic Algorithms ( Work in Progress ) , 2001 .

[184]  Yuefan Deng,et al.  New trends in high performance computing , 2001, Parallel Computing.

[185]  Katherine A. Yelick,et al.  Optimizing Sparse Matrix Computations for Register Reuse in SPARSITY , 2001, International Conference on Computational Science.

[186]  Robert A. van de Geijn,et al.  FLAME: Formal Linear Algebra Methods Environment , 2001, TOMS.

[187]  Victor Eijkhout,et al.  Recursive approach in sparse matrix LU factorization , 2001, Sci. Program..

[188]  Jack J. Dongarra,et al.  Automated empirical optimizations of software and the ATLAS project , 2001, Parallel Comput..

[189]  George V. Meghabghab,et al.  Google's web page ranking applied to different topological web graph structures , 2001, J. Assoc. Inf. Sci. Technol..

[190]  Fred G. Gustavson,et al.  A recursive formulation of Cholesky factorization of a matrix in packed storage , 2001, TOMS.

[191]  D. Tafti GenIDLEST: A Scalable Parallel Computational Tool for Simulating Complex Turbulent Flows , 2001, Fluids Engineering.

[192]  Siddhartha Chatterjee,et al.  Exact analysis of the cache behavior of nested loops , 2001, PLDI '01.

[193]  Michael I. Jordan,et al.  Link Analysis, Eigenvectors and Stability , 2001, IJCAI.

[194]  Dragan Mirkovic,et al.  Automatic Performance Tuning in the UHFFT Library , 2001, International Conference on Computational Science.

[195]  Laura Carrington,et al.  Modeling application performance by convolving machine signatures with application profiles , 2001 .

[196]  Nayda G. Santiago,et al.  A statistical approach for the analysis of the relation between low-level performance information, the code, and the environment , 2002, Proceedings. International Conference on Parallel Processing Workshop.

[197]  James Demmel,et al.  Performance Optimizations and Bounds for Sparse Matrix-Vector Multiply , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[198]  Elizabeth R. Jessup,et al.  Toward Memory-Efficient Linear Solvers , 2002, VECPAR.

[199]  L. Kish End of Moore's law: thermal (noise) death of integration in micro and nano electronics , 2002 .

[200]  David Parello,et al.  On Increasing Architecture Awareness in Program Optimizations to Bridge the Gap between Peak and Sustained Processor Performance — Matrix-Multiply Revisited , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[201]  Jeffrey S. Vetter,et al.  Scalable Analysis Techniques for Microprocessor Performance Counter Metrics , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[202]  Paul H. J. Kelly,et al.  Delayed Evaluation, Self-optimising Software Components as a Programming Model , 2002, Euro-Par.

[203]  David E. Bernholdt,et al.  A High-Level Approach to Synthesis of High-Performance Codes for Quantum Chemistry , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[204]  Iain S. Duff,et al.  An overview of the sparse basic linear algebra subprograms: The new standard from the BLAS technical forum , 2002, TOMS.

[205]  Pedro C. Diniz,et al.  A compiler approach to fast hardware design space exploration in FPGA-based systems , 2002, PLDI '02.

[206]  Daniel A. Reed,et al.  Markov model prediction of I/O requests for scientific applications , 2002, ICS '02.

[207]  J. Demmel,et al.  An updated set of basic linear algebra subprograms (BLAS) , 2002, TOMS.

[208]  Keith H. Randall,et al.  Denali: a goal-directed superoptimizer , 2002, PLDI '02.

[209]  Jorge J. Moré,et al.  Digital Object Identifier (DOI) 10.1007/s101070100263 , 2001 .

[210]  J. Darcy Finding a Fast Quicksort Implementation for Java , 2002 .

[211]  Katherine Yelick,et al.  Automatic Performance Tuning and Analysis of Sparse Triangular Solve , 2002 .

[212]  Christoph W. Ueberhuber,et al.  Cache Oblivious High Performance Algorithms for Matrix Multiplication , 2002 .

[213]  Gerhard Wellein,et al.  Fast Sparse Matrix-Vector Multiplication for TeraFlop/s Computers , 2002, VECPAR.

[214]  Jeffrey K. Hollingsworth,et al.  SIGMA: A Simulator Infrastructure to Guide Memory Analysis , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[215]  Sivan Toledo,et al.  Nested-Dissection Orderings for Sparse LU with Partial Pivoting , 2002, SIAM J. Matrix Anal. Appl..

[216]  I-Hsin Chung,et al.  Active Harmony: Towards Automated Performance Tuning , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[217]  Jean-Guillaume Dumas,et al.  Finite field linear algebra subroutines , 2002, ISSAC '02.

[218]  David A. Padua,et al.  MaJIC: compiling MATLAB for speed and responsiveness , 2002, PLDI '02.

[219]  Dror Rawitz,et al.  The hardness of cache conscious data placement , 2002, POPL '02.

[220]  Eli Upfal,et al.  Using PageRank to Characterize Web Structure , 2002, COCOON.

[221]  Sanjukta Bhowmick,et al.  A Combinatorial Scheme for Developing Efficient Composite Solvers , 2002, International Conference on Computational Science.

[222]  Masha Sosonkina,et al.  Parallel Iterative Methods in Modern Physical Applications , 2002, International Conference on Computational Science.

[223]  David E. Bernholdt,et al.  Space-time trade-off optimization for a class of electronic structure calculations , 2002, PLDI '02.

[224]  A. Rozga,et al.  Maternal sensitivity and attachment in atypical groups. , 2002, Advances in child development and behavior.

[225]  Dror Irony,et al.  Parallel and Fully Recursive Multifrontal Supernodal Sparse Cholesky , 2002, International Conference on Computational Science.

[226]  Jasmine Novak,et al.  PageRank Computation and the Structure of the Web: Experiments and Algorithms , 2002 .

[227]  John M. Mellor-Crummey,et al.  Experiences tuning SMG98: a semicoarsening multigrid benchmark based on the hypre library , 2002, ICS '02.

[228]  Paul N. Hilfinger,et al.  Better Tiling and Array Contraction for Compiling Scientific Programs , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[229]  Iain S. Duff,et al.  Algorithm 818: A reference model implementation of the sparse BLAS in fortran 95 , 2002, TOMS.

[230]  Jeffrey Scott Vitter,et al.  Efficient sorting using registers and caches , 2000, JEAL.

[231]  M. Gilli,et al.  Solving finite difference schemes arising in trivariate option pricing , 2002 .

[232]  Taher H. Haveliwala Topic-sensitive PageRank , 2002, IEEE Trans. Knowl. Data Eng..

[233]  Gerth Stølting Brodal,et al.  Cache oblivious search trees via binary trees of small height , 2001, SODA '02.

[234]  C. Leopold Tight Bounds on Capacity Misses for 3D Stencil Codes , 2002 .

[235]  Jeffrey S. Vetter,et al.  An Empirical Performance Evaluation of Scalable Scientific Applications , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[236]  Michael A. Bender,et al.  Cache-oblivious priority queue and graph algorithm applications , 2002, STOC '02.

[237]  Zizhong Chen,et al.  Self-Adapting Software for Numerical Linear Algebra Library Routines on Clusters , 2003, International Conference on Computational Science.

[238]  Gene H. Golub,et al.  Extrapolation methods for accelerating PageRank computations , 2003, WWW '03.

[239]  Li Chen,et al.  Parallel Finite Element Analysis Platform for the Earth Simulator: GeoFEM , 2003, International Conference on Computational Science.

[240]  David I. August,et al.  Compiler optimization-space exploration , 2003, International Symposium on Code Generation and Optimization, 2003. CGO 2003..

[241]  Zizhong Chen,et al.  Self-adapting software for numerical linear algebra and LAPACK for clusters , 2003, Parallel Comput..

[242]  Sally A. McKee,et al.  METRIC: tracking down inefficiencies in the memory hierarchy via binary rewriting , 2003, International Symposium on Code Generation and Optimization, 2003. CGO 2003..

[243]  James W. Thomas Inlining of Mathematical Functions in HP-UX for Itanium ® 2 , 2003, CGO.

[244]  Jennifer Widom,et al.  Scaling personalized web search , 2003, WWW '03.

[245]  Gene H. Golub,et al.  Exploiting the Block Structure of the Web for Computing , 2003 .

[246]  Vijay Kumar,et al.  Efficient galois field arithmetic on SIMD architectures , 2003, SPAA '03.

[247]  Victor Eijkhout,et al.  Self-Adapting Numerical Software and Automatic Tuning of Heuristics , 2003, International Conference on Computational Science.

[248]  John A. Tomlin,et al.  A new paradigm for ranking pages on the world wide web , 2003, WWW '03.

[249]  Taher H. Haveliwala,et al.  The Second Eigenvalue of the Google Matrix , 2003 .

[250]  Larry Carter,et al.  Compile-time composition of run-time data and iteration reorderings , 2003, PLDI '03.

[251]  Michael Franz,et al.  Continuous program optimization: A case study , 2003, TOPL.

[252]  J. Shalf,et al.  Evaluation of Cache-based Superscalar and Cacheless Vector Architectures for Scientific Computations , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[253]  Jeremy D. Frens,et al.  QR factorization with Morton-ordered quadtree matrices for memory re-use and parallelism , 2003, PPoPP '03.

[254]  Yunheung Paek,et al.  Finding effective optimization phase sequences , 2003 .

[255]  Jean-Francois Collard,et al.  Optimizations to prevent cache penalties for the Intel® Itanium® 2 Processor , 2003, CGO.

[256]  F. Petrini,et al.  The Case of the Missing Supercomputer Performance: Achieving Optimal Performance on the 8,192 Processors of ASCI Q , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[257]  Gang Ren,et al.  A comparison of empirical and model-driven optimization , 2003, PLDI '03.

[258]  James W. Thomas Inlining of mathematical functions in HP-UX for Itanium/sup /spl reg// 2 , 2003, International Symposium on Code Generation and Optimization, 2003. CGO 2003..

[259]  Saman P. Amarasinghe,et al.  Meta optimization: improving compiler heuristics with machine learning , 2003, PLDI '03.

[260]  Chandra Krintz Coupling on-line and off-line profile information to improve program performance , 2003, International Symposium on Code Generation and Optimization, 2003. CGO 2003..

[261]  Derek Bruening,et al.  An infrastructure for adaptive dynamic optimization , 2003, International Symposium on Code Generation and Optimization, 2003. CGO 2003..

[262]  S. Kirkland Conditioning properties of the stationary distribution for a Markov chain , 2003 .

[263]  Pedro C. Diniz A Compiler Approach to Performance Prediction Using Empirical-Based Modeling , 2003, International Conference on Computational Science.

[264]  Bernhard Schölkopf,et al.  A tutorial on support vector regression , 2004, Stat. Comput..

[265]  Shang-Hua Teng,et al.  Recovering Mesh Geometry from a Stiffness Matrix , 2002, Numerical Algorithms.

[266]  Richard W. Vuduc,et al.  Sparsity: Optimization Framework for Sparse Matrix Kernels , 2004, Int. J. High Perform. Comput. Appl..

[267]  Timothy A. Davis,et al.  A column approximate minimum degree ordering algorithm , 2000, TOMS.

[268]  Keith D. Cooper,et al.  Adaptive Optimizing Compilers for the 21st Century , 2002, The Journal of Supercomputing.

[269]  Darren J. Wilkinson,et al.  A sparse matrix approach to Bayesian computation in large linear models , 2004, Comput. Stat. Data Anal..

[270]  Vivek Sarkar Optimized Unrolling of Nested Loops , 2004, International Journal of Parallel Programming.

[271]  Larry Carter,et al.  Quantifying the Multi-Level Nature of Tiling Interactions , 1997, International Journal of Parallel Programming.

[272]  Allen,et al.  Optimizing Compilers for Modern Architectures , 2004 .

[273]  A data locality optimizing algorithm , 2004, SIGP.

[274]  Robert A. van de Geijn,et al.  A Family of High-Performance Matrix Multiplication Algorithms , 2004, PARA.

[275]  Taher H. Haveliwala,et al.  Adaptive methods for the computation of PageRank , 2004 .

[276]  Amy Nicole Langville,et al.  A Survey of Eigenvector Methods for Web Information Retrieval , 2005, SIAM Rev..

[277]  Elizabeth R. Jessup,et al.  A Technique for Accelerating the Convergence of Restarted GMRES , 2005, SIAM J. Matrix Anal. Appl..

[278]  Bernard Philippe,et al.  Numerical Methods in Markov Chain Modeling , 1992, Oper. Res..