论文信息 - Automatic performance tuning of sparse matrix kernels

Automatic performance tuning of sparse matrix kernels

This dissertation presents an automated system to generate highly efficient, platform-adapted implementations of sparse matrix kernels. We show that conventional implementations of important sparse kernels like sparse matrix-vector multiply (SpMV) have historically run at 10% or less of peak machine speed on cache-based superscalar architectures. Our implementations of SpMV, automatically tuned using a methodology based on empirical-search, can by contrast achieve up to 31% of peak machine speed, and can be up to 4× faster. Given a matrix, kernel, and machine; our approach to selecting a fast implementation consists of two steps: (1) we identify and generate a space of reasonable implementations, and then (2) search this space for the fastest one using a combination of heuristic models and actual experiments (i.e., running and timing the code). We build on the SPARSITY system for generating highly-tuned implementations of the SpMV kernel y ← y + Ax, where A is a sparse matrix and x, y are dense vectors. We extend SPARSITY to support tuning for a variety of common non-zero patterns arising in practice, and for additional kernels like sparse triangular solve (SpTS) and computation of ATA·x (or AAT·x) and A ρ·x. We develop new models to compute, for particular data structures and kernels, the best absolute performance (e.g., Mflop/s) we might expect on a given matrix and machine. These performance upper bounds account for the cost of memory operations at all levels of the memory hierarchy, but assume ideal instruction scheduling and low-level tuning. We evaluate our performance with respect to such bounds, finding that the generated and tuned implementations of SpMV and SpTS achieve up to 75% of the performance bound. This finding places limits on the effectiveness of additional low-level tuning (e.g., better instruction selection and scheduling). (Abstract shortened by UMI.)

Richard Vuduc | James Demmel | J. Demmel | R. Vuduc

[1] A. Kolmogoroff. Confidence Limits for an Unknown Distribution Function , 1941 .

[2] Z. Birnbaum. Numerical Tabulation of the Distribution of Kolmogorov's Statistic for Finite Sample Size , 1952 .

[3] G. E. Noether. Note on the kolmogorov statistic in the discrete case , 1963 .

[4] J. Tukey,et al. An algorithm for the machine calculation of complex Fourier series , 1965 .

[5] J. W. Walker,et al. Direct solutions of sparse network equations by optimally ordered triangular factorization , 1967 .

[6] E. Cuthill,et al. Reducing the bandwidth of sparse symmetric matrices , 1969, ACM '69.

[7] Donald E. Knuth,et al. An empirical study of FORTRAN programs , 1971, Softw. Pract. Exp..

[8] David Siegmund,et al. Great expectations: The theory of optimal stopping , 1971 .

[9] D. Rose. A GRAPH-THEORETIC STUDY OF THE NUMERICAL SOLUTION OF SPARSE POSITIVE DEFINITE SYSTEMS OF LINEAR EQUATIONS , 1972 .

[10] Udo W. Pooch,et al. A Survey of Indexing Techniques for Sparse Matrices , 1973, CSUR.

[11] A. George. Nested Dissection of a Regular Finite Element Mesh , 1973 .

[12] John R. Rice,et al. The Algorithm Selection Problem , 1976, Adv. Comput..

[13] P. Bickel,et al. Mathematical Statistics: Basic Ideas and Selected Topics , 1977 .

[14] Fred G. Gustavson,et al. Two Fast Algorithms for Sparse Matrices: Multiplication and Permuted Transposition , 1978, TOMS.

[15] Alan George,et al. The Design of a User Interface for a Sparse Matrix Package , 1979, TOMS.

[16] Charles L. Lawson,et al. Basic Linear Algebra Subprograms for Fortran Usage , 1979, TOMS.

[17] H. T. Kung,et al. I/O complexity: The red-blue pebble game , 1981, STOC '81.

[18] Susan L. Graham,et al. Gprof: A call graph execution profiler , 1982, SIGPLAN '82.

[19] Thomas R. Gross,et al. Postpass Code Optimization of Pipeline Constraints , 1983, TOPL.

[20] Anne Lohrli. Chapman and Hall , 1985 .

[21] John R. Rice,et al. Solving elliptic problems using ELLPACK , 1985, Springer series in computational mathematics.

[22] Katherine Yelick,et al. Performance Optimizations and Bounds for Sparse Symmetric Matrix-Multiple Vector Multiply , 1985 .

[23] I. Duff,et al. Direct Methods for Sparse Matrices , 1987 .

[24] Henry Massalin. Superoptimizer: a look at the smallest program , 1987, ASPLOS 1987.

[25] J. Rice. Mathematical Statistics and Data Analysis , 1988 .

[26] Thomas F. Coleman,et al. A parallel triangular solver for distributed-memory multiprocessor , 1988 .

[27] Ken Kennedy,et al. Estimating Interlock and Improving Balance for Pipelined Architectures , 1988, J. Parallel Distributed Comput..

[28] Thomas S. Ferguson,et al. Who Solved the Secretary Problem , 1989 .

[29] Y. Saad,et al. Krylov Subspace Methods on Supercomputers , 1989 .

[30] Jack J. Dongarra,et al. A set of level 3 basic linear algebra subprograms , 1990, TOMS.

[31] Monica S. Lam,et al. The cache performance and optimizations of blocked algorithms , 1991, ASPLOS IV.

[32] Joel H. Saltz,et al. Run-Time Parallelization and Scheduling of Loops , 1991, IEEE Trans. Computers.

[33] Timothy A. Davis,et al. An Unsymmetric-pattern Multifrontal Method for Sparse Lu Factorization , 1993 .

[34] Scott A. Mahlke,et al. Using profile information to assist classic code optimizations , 1991, Softw. Pract. Exp..

[35] Monica S. Lam,et al. A data locality optimizing algorithm , 1991, PLDI '91.

[36] Rafael Hector Saavedra-Barrera,et al. CPU performance evaluation and execution time prediction using narrow spectrum benchmarking , 1992 .

[37] Richard Kenner,et al. Eliminating branches using a superoptimizer and the GNU C compiler , 1992, PLDI '92.

[38] Anoop Gupta,et al. Parallel ICCG on a hierarchical memory multiprocessor - Addressing the triangular solve bottleneck , 1990, Parallel Comput..

[39] Michael Lucks,et al. Automated selection of mathematical software , 1992, TOMS.

[40] John R. Gilbert,et al. Sparse Matrices in MATLAB: Design and Implementation , 1992, SIAM J. Matrix Anal. Appl..

[41] David H. Bailey,et al. NAS parallel benchmark results , 1992, Proceedings Supercomputing '92.

[42] Ken Kennedy,et al. Compiler blockability of numerical algorithms , 1992, Proceedings Supercomputing '92.

[43] Olivier Temam,et al. Characterizing the behavior of sparse algorithms on caches , 1992, Proceedings Supercomputing '92.

[44] Fernando L. Alvarado,et al. Optimal Parallel Solution of Sparse Triangular Systems , 1993, SIAM J. Sci. Comput..

[45] Alexander A. Stepanov,et al. Algorithm‐oriented generic libraries , 1994, Softw. Pract. Exp..

[46] Mark T. Jones,et al. Scalable Iterative Solution of Sparse Linear Systems , 1994, Parallel Comput..

[47] S. CohnData. Assessing the Eeects of Data Selection with Dao's Physical-space Statistical Analysis System , 1994 .

[48] Richard Barrett,et al. Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods , 1994, Other Titles in Applied Mathematics.

[49] Susan T. Dumais,et al. Using Linear Algebra for Intelligent Information Retrieval , 1995, SIAM Rev..

[50] Kathryn S. McKinley,et al. Tile size selection using cache organization and data layout , 1995, PLDI '95.

[51] Eunice E. Santos. Solving triangular linear systems in parallel using substitution , 1995, Proceedings.Seventh IEEE Symposium on Parallel and Distributed Processing.

[52] Edward Rothberg,et al. Alternatives for Solving Sparse Triangular Systems on Distributed-Memory Multiprocessors , 1995, Parallel Comput..

[53] Weichung Wang,et al. Adaptive use of iterative methods in interior point methods for linear programming , 1995 .

[54] Vipin Kumar,et al. Parallel Algorithms for Forward Elimination and Backward Substitution in Direct Solution of Sparse L , 1995 .

[55] John E. Savage. Extending the Hong-Kung Model to Memory Hierarchies , 1995, COCOON.

[56] Sally A. McKee,et al. Hitting the memory wall: implications of the obvious , 1995, CARN.

[57] Eric A. Brewer,et al. High-level optimization via automated statistical modeling , 1995, PPOPP '95.

[58] Aart J. C. Bik,et al. Advanced Compiler Optimizations for Sparse Computations , 1995, J. Parallel Distributed Comput..

[59] Rajiv Gupta,et al. Adaptive loop transformations for scientific programs , 1995, Proceedings.Seventh IEEE Symposium on Parallel and Distributed Processing.

[60] Preston Briggs. Sparse matrix multiplication , 1996, SIGP.

[61] William Gropp,et al. MPI-2: Extending the Message-Passing Interface , 1996, Euro-Par, Vol. I.

[62] J. R. Johnson,et al. Implementation of Strassen's Algorithm for Matrix Multiplication , 1996, Proceedings of the 1996 ACM/IEEE Conference on Supercomputing.

[63] Josep-Lluís Larriba-Pey,et al. Block algorithms for sparse matrix computations on high performance workstations , 1996, ICS '96.

[64] Craig C. Douglas,et al. Caching in with Multigrid Algorithms: Problems in Two Dimensions , 1996, Parallel Algorithms Appl..

[65] Bowen Alpern,et al. Hierarchical Tiling: A Methodology for High Performance , 1996 .

[66] William Gropp,et al. Efficient Management of Parallelism in Object-Oriented Numerical Software Libraries , 1997, SciTools.

[67] Sandra Fillebrown,et al. The MathWorks' MATLAB , 1996 .

[68] Patrick R. Amestoy,et al. An Approximate Minimum Degree Ordering Algorithm , 1996, SIAM J. Matrix Anal. Appl..

[69] Richard F. Barrett,et al. Matrix Market: a web resource for test matrix collections , 1996, Quality of Numerical Software.

[70] Chau-Wen Tseng,et al. Improving data locality with loop transformations , 1996, TOPL.

[71] S. Bikhchandani,et al. Optimal Search with Learning , 2011 .

[72] James R. Larus,et al. Efficient path profiling , 1996, Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO 29.

[73] Paul Vinson Stodghill,et al. A Relational Approach to the Automatic Generation of Sequential Sparse matrix Codes , 1997 .

[74] J. Demmel,et al. Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology , 1997 .

[75] Jeremy D. Frens,et al. Auto-blocking matrix-multiplication or tracking BLAS3 performance from source code , 1997, PPOPP '97.

[76] Martin C. Rinard,et al. Dynamic feedback: an effective technique for adaptive computing , 1997, PLDI '97.

[77] M. SIAMJ.. FAST NESTED DISSECTION FOR FINITE ELEMENT MESHES , 1997 .

[78] James Demmel,et al. Applied Numerical Linear Algebra , 1997 .

[79] Sivan Toledo,et al. Improving the memory-system performance of sparse-matrix vector multiplication , 1997, IBM J. Res. Dev..

[80] John R. Gilbert,et al. Aspect-Oriented Programming of Sparse Matrix Code , 1997, ISCOPE.

[81] P. Sadayappan,et al. On improving the performance of sparse matrix-vector multiplication , 1997, Proceedings Fourth International Conference on High-Performance Computing.

[82] Vipin Kumar,et al. A high performance two dimensional scalable parallel algorithm for solving sparse triangular systems , 1997, Proceedings Fourth International Conference on High-Performance Computing.

[83] Sivan Toledo. Locality of Reference in LU Decomposition with Partial Pivoting , 1997, SIAM J. Matrix Anal. Appl..

[84] Michael B. Giles,et al. Renumbering unstructured grids to improve the performance of codes on hierarchical memory machines , 1997 .

[85] Mark Leone,et al. Dynamo: A Staged Compiler Architecture for Dynamic Program Optimization , 1997 .

[86] Florin Dobrian,et al. Object-Oriented Design for Sparse Direct Solvers , 1998, ISCOPE.

[87] Aart J. C. Bik,et al. The automatic generation of sparse primitives , 1998, TOMS.

[88] Jack Dongarra,et al. Developing numerical libraries in Java , 1998 .

[89] Stefan Andersson,et al. RS/6000 Scientific and Technical Computing: POWER3 Introduction and Tuning Guide , 1998 .

[90] Jack J. Dongarra,et al. Automatically Tuned Linear Algebra Software , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[91] Jeremy G. Siek,et al. A Rational Approach to Portable High Performance: The Basic Linear Algebra Instruction Set (BLAIS) and the Fixed Algorithm Size Template (FAST) Library , 1998, ECOOP Workshops.

[92] Todd L. Veldhuizen,et al. Arrays in Blitz++ , 1998, ISCOPE.

[93] Bo Kågström,et al. GEMM-based level 3 BLAS: high-performance model implementations and performance evaluation benchmark , 1998, TOMS.

[94] Vladimir Vapnik,et al. Statistical learning theory , 1998 .

[95] James Demmel,et al. The PHiPAC v1.0 Matrix-Multiply Distribution , 1998 .

[96] Brendan J. Frey,et al. Graphical Models for Machine Learning and Digital Communication , 1998 .

[97] Kanad Ghose,et al. Caching-efficient multithreaded fast multiplication of sparse matrices , 1998, Proceedings of the First Merged International Parallel Processing Symposium and Symposium on Parallel and Distributed Processing.

[98] Clark D. Thomborson,et al. Data cache parameter measurements , 1998, Proceedings International Conference on Computer Design. VLSI in Computers and Processors (Cat. No.98CB36273).

[99] Edith Cohen,et al. Structure Prediction and Computation of Sparse Matrix Products , 1998, J. Comb. Optim..

[100] Dennis Gannon,et al. Active Libraries: Rethinking the roles of compilers and libraries , 1998, ArXiv.

[101] G.E. Moore,et al. Cramming More Components Onto Integrated Circuits , 1998, Proceedings of the IEEE.

[102] Mithuna Thottethodi,et al. Tuning Strassen's Matrix Multiplication for Memory Efficiency , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[103] S. Cohn,et al. Assessing the Effects of Data Selection with the DAO Physical-Space Statistical Analysis System* , 1998 .

[104] Vipin Kumar,et al. A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs , 1998, SIAM J. Sci. Comput..

[105] Charles Consel,et al. Tempo: specializing systems applications and beyond , 1998, CSUR.

[106] Steven G. Johnson,et al. FFTW: an adaptive software architecture for the FFT , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[107] James Demmel,et al. Multigrid equation solvers for large-scale nonlinear finite element simulations , 1999 .

[108] Matteo Frigo,et al. Cache-oblivious algorithms , 1999, 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039).

[109] Dawson R. Engler,et al. C and tcc: a language and compiler for dynamic code generation , 1999, TOPL.

[110] Roman Geus,et al. Towards a fast parallel sparse matrix-vector multiplication , 2000, PARCO.

[111] Cleve Ashcraft,et al. SPOOLES: An Object-Oriented Sparse Matrix Library , 1999, PPSC.

[112] Craig S. K. Clapp,et al. Instruction-level Parallelism in AES Candidates , 1999 .

[113] James Demmel,et al. A Supernodal Approach to Sparse Partial Pivoting , 1999, SIAM J. Matrix Anal. Appl..

[114] Paul van der Mark,et al. Using Iterative Compilation for Managing Software Pipeline-Unrolling Trade-offs , 1999 .

[115] Francisco F. Rivera,et al. Modeling and Improving Locality for Irregular Problems: Sparse Matrix-Vector Product on Cache Memories as a Cache Study , 1999, HPCN Europe.

[116] Vipin Kumar,et al. PSPASES: An Efficient and Scalable Parallel Sparse Direct Solver , 1999, PPSC.

[117] James Demmel,et al. LAPACK Users' Guide, Third Edition , 1999, Software, Environments and Tools.

[118] A. Pinar,et al. Improving Performance of Sparse Matrix-Vector Multiplication , 1999, ACM/IEEE SC 1999 Conference (SC'99).

[119] John C. Platt,et al. Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[120] Matteo Frigo,et al. A fast Fourier transform compiler , 1999, SIGP.

[121] Keshav Pingali,et al. A case for source-level transformations in MATLAB , 1999, DSL '99.

[122] Sharad Malik,et al. Cache miss equations: a compiler framework for analyzing and tuning memory behavior , 1999, TOPL.

[123] Taher H. Haveliwala. Efficient Computation of PageRank , 1999 .

[124] Michael T. Heath,et al. Performance of Parallel Sparse Triangular Solution , 1999 .

[125] Emilio L. Zapata,et al. Automatic analytical modeling for the estimation of cache misses , 1999, 1999 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.PR00425).

[126] Aart J. C. Bik,et al. Automatic Nonzero Structure Analysis , 1999, SIAM J. Comput..

[127] Rajeev Motwani,et al. The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[128] Katherine A. Yelick,et al. Optimizing Sparse Matrix Vector Multiplication on SMP , 1999, SIAM Conference on Parallel Processing for Scientific Computing.

[129] E. Im,et al. Optimizing Sparse Matrix Vector Multiplication on SMP , 1999, PPSC.

[130] Jon Kleinberg,et al. Authoritative sources in a hyperlinked environment , 1999, SODA '98.

[131] Kang Su Gatlin,et al. Architecture-Cognizant Divide and Conquer Algorithms , 1999, ACM/IEEE SC 1999 Conference (SC'99).

[132] Michael Voss,et al. ADAPT: Automated De-coupled Adaptive Program Transformation , 2000, Proceedings 2000 International Conference on Parallel Processing.

[133] John Worley,et al. AES Finalists on PA-RISC and IA-64: Implementations & Performance , 2000, AES Candidate Conference.

[134] Y. Saad,et al. Iterative solution of linear systems in the 20th century , 2000 .

[135] Lawrence E. Bassham. Efficiency Testing of ANSI C Implementations of Round 2 Candidate Algorithms for the Advanced Encryption Standard , 2000, AES Candidate Conference.

[136] Eun Im,et al. Optimizing the Performance of Sparse Matrix-Vector Multiplication , 2000 .

[137] Naren Ramakrishnan,et al. Note on generalization in experimental algorithmics , 2000, TOMS.

[138] James Demmel,et al. Statistical Modeling of Feedback Data in an Automatic Tuning System , 2000 .

[139] C. F. Jeff Wu,et al. Experiments: Planning, Analysis, and Parameter Design Optimization , 2000 .

[140] Dragan Mirkovic,et al. An adaptive software library for fast Fourier transforms , 2000, ICS '00.

[141] Bryan Weeks,et al. Hardware Performance Simulations of Round 2 Advanced Encryption Standard Algorithms , 2000, AES Candidate Conference.

[142] Michail G. Lagoudakis,et al. Algorithm Selection using Reinforcement Learning , 2000, ICML.

[143] Andy Nisbet,et al. GAPS: Iterative Feedback Directed Parallelisation Using Genetic Algorithms , 2000 .

[144] Jack J. Dongarra,et al. A Scalable Cross-Platform Infrastructure for Application Performance Tuning Using Hardware Counters , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[145] William Gropp,et al. Latency, bandwidth, and concurrent issue limitations in high-performance CFD. , 2000 .

[146] M. Challacombe. A general parallel sparse-blocked matrix multiply for linear scaling SCF theory , 2000 .

[147] Jeffrey Scott Vitter,et al. Efficient Sorting Using Registers and Caches , 2000, Algorithm Engineering.

[148] Fred G. Gustavson,et al. LAWRA: Linear Algebra with Recursive Algorithms , 2000, PARA.

[149] Michael D. Smith,et al. Overcoming the Challenges to Feedback-Directed Optimization , 2000, Dynamo.

[150] Andrei Z. Broder,et al. Graph structure in the Web , 2000, Comput. Networks.

[151] T. Kisuki,et al. Iterative Compilation in Program Optimization , 2000 .

[152] Manuela M. Veloso,et al. Learning to Predict Performance from Formula Modeling and Training Data , 2000, ICML.

[153] Bruce Schneier,et al. A Performance Comparison of the Five AES Finalists , 2000, AES Candidate Conference.

[154] Keshav Pingali,et al. A Framework for Sparse Matrix Code Synthesis from High-level Specifications , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[155] Keshav Pingali,et al. Next-generation generic programming and its application to sparse matrix computations , 2000, ICS '00.

[156] Matthew Arnold,et al. Adaptive Optimization in the Jalapeo JVM: The Controller's Analytical Model , 2000 .

[157] James C. Browne,et al. Compositional Development of Performance Models in Poems , 2000, Int. J. High Perform. Comput. Appl..

[158] Richard Weiss,et al. A Comparison of AES Candidates on the Alpha 21264 , 2000, AES Candidate Conference.

[159] Michele Colajanni,et al. PSBLAS: a library for parallel linear algebra computation on sparse matrices , 2000, TOMS.

[160] Michael A. Bender,et al. Cache-oblivious B-trees , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[161] Naren Ramakrishnan,et al. PYTHIA-II: a knowledge/database system for managing performance data and recommending scientific software , 2000, TOMS.

[162] Markus Mock,et al. DyC: an expressive annotation-directed dynamic compiler for C , 2000, Theor. Comput. Sci..

[163] C. Thomborson,et al. MEASURING DATA CACHE AND TLB PARAMETERS UNDER LINUX , 2000 .

[164] Ken Kennedy,et al. Transforming loops to recursion for multi-level memory hierarchies , 2000, PLDI '00.

[165] Gerd Heber,et al. Self‐avoiding walks over adaptive unstructured grids , 2000 .

[166] Keshav Pingali,et al. The Bernoulli Generic Matrix Library , 2000 .

[167] Ulrich Rüde,et al. Cache Optimization for Structured and Unstructured Grid Multigrid , 2000 .

[168] Fumihiko Sano,et al. Performance Evaluation of AES Finalists on the High-End Smart Card , 2000, AES Candidate Conference.

[169] Siddhartha Chatterjee,et al. Cache-Efficient Multigrid Algorithms , 2001, Int. J. High Perform. Comput. Appl..

[170] Kunle Olukotun,et al. High Bandwidth On-Chip Cache Design , 2001, IEEE Trans. Computers.

[171] Jeremy D. Frens,et al. Language support for Morton-order matrices , 2001, PPoPP '01.

[172] Patrick Amestoy,et al. A Fully Asynchronous Multifrontal Solver Using Distributed Dynamic Scheduling , 2001, SIAM J. Matrix Anal. Appl..

[173] José M. F. Moura,et al. Fast Automatic Generation of DSP Algorithms , 2001, International Conference on Computational Science.

[174] Joseph L. Hellerstein,et al. Using Control Theory to Achieve Service Level Objectives In Performance Management , 2001, 2001 IEEE/IFIP International Symposium on Integrated Network Management Proceedings. Integrated Network Management VII. Integrated Management Strategies for the New Millennium (Cat. No.01EX470).

[175] James Demmel,et al. Statistical Models for Automatic Performance Tuning , 2001, International Conference on Computational Science.

[176] Larry Carter,et al. A Modal Model of Memory , 2001, International Conference on Computational Science.

[177] Roldan Pozo,et al. NIST sparse BLAS user's guide , 2001 .

[178] James Demmel,et al. Preconditioning sparse matrices for computing eigenvalues and solving linear systems of equations , 2001 .

[179] Greg M. Henry,et al. Flexible High-Performance Matrix Multiply via a Self-Modifying Runtime Code , 2001 .

[180] Larry Carter,et al. Rescheduling for Locality in Sparse Matrix Computations , 2001, International Conference on Computational Science.

[181] K. Cooper,et al. Compilation Order Matters , 2001 .

[182] William Kahan,et al. Document for the Basic Linear Algebra Subprograms (BLAS) standard: BLAS Technical Forum , 2001 .

[183] Dr. Andy P. Nisbet,et al. Towards Retargettable Compilers — Feedback Directed Compilation Using Genetic Algorithms ( Work in Progress ) , 2001 .

[184] Yuefan Deng,et al. New trends in high performance computing , 2001, Parallel Computing.

[185] Katherine A. Yelick,et al. Optimizing Sparse Matrix Computations for Register Reuse in SPARSITY , 2001, International Conference on Computational Science.

[186] Robert A. van de Geijn,et al. FLAME: Formal Linear Algebra Methods Environment , 2001, TOMS.

[187] Victor Eijkhout,et al. Recursive approach in sparse matrix LU factorization , 2001, Sci. Program..

[188] Jack J. Dongarra,et al. Automated empirical optimizations of software and the ATLAS project , 2001, Parallel Comput..

[189] George V. Meghabghab,et al. Google's web page ranking applied to different topological web graph structures , 2001, J. Assoc. Inf. Sci. Technol..

[190] Fred G. Gustavson,et al. A recursive formulation of Cholesky factorization of a matrix in packed storage , 2001, TOMS.

[191] D. Tafti. GenIDLEST: A Scalable Parallel Computational Tool for Simulating Complex Turbulent Flows , 2001, Fluids Engineering.

[192] Siddhartha Chatterjee,et al. Exact analysis of the cache behavior of nested loops , 2001, PLDI '01.

[193] Michael I. Jordan,et al. Link Analysis, Eigenvectors and Stability , 2001, IJCAI.

[194] Dragan Mirkovic,et al. Automatic Performance Tuning in the UHFFT Library , 2001, International Conference on Computational Science.

[195] Laura Carrington,et al. Modeling application performance by convolving machine signatures with application profiles , 2001 .

[196] Nayda G. Santiago,et al. A statistical approach for the analysis of the relation between low-level performance information, the code, and the environment , 2002, Proceedings. International Conference on Parallel Processing Workshop.

[197] James Demmel,et al. Performance Optimizations and Bounds for Sparse Matrix-Vector Multiply , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[198] Elizabeth R. Jessup,et al. Toward Memory-Efficient Linear Solvers , 2002, VECPAR.

[199] L. Kish. End of Moore's law: thermal (noise) death of integration in micro and nano electronics , 2002 .

[200] David Parello,et al. On Increasing Architecture Awareness in Program Optimizations to Bridge the Gap between Peak and Sustained Processor Performance — Matrix-Multiply Revisited , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[201] Jeffrey S. Vetter,et al. Scalable Analysis Techniques for Microprocessor Performance Counter Metrics , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[202] Paul H. J. Kelly,et al. Delayed Evaluation, Self-optimising Software Components as a Programming Model , 2002, Euro-Par.

[203] David E. Bernholdt,et al. A High-Level Approach to Synthesis of High-Performance Codes for Quantum Chemistry , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[204] Iain S. Duff,et al. An overview of the sparse basic linear algebra subprograms: The new standard from the BLAS technical forum , 2002, TOMS.

[205] Pedro C. Diniz,et al. A compiler approach to fast hardware design space exploration in FPGA-based systems , 2002, PLDI '02.

[206] Daniel A. Reed,et al. Markov model prediction of I/O requests for scientific applications , 2002, ICS '02.

[207] J. Demmel,et al. An updated set of basic linear algebra subprograms (BLAS) , 2002, TOMS.

[208] Keith H. Randall,et al. Denali: a goal-directed superoptimizer , 2002, PLDI '02.

[209] Jorge J. Moré,et al. Digital Object Identifier (DOI) 10.1007/s101070100263 , 2001 .

[210] J. Darcy. Finding a Fast Quicksort Implementation for Java , 2002 .

[211] Katherine Yelick,et al. Automatic Performance Tuning and Analysis of Sparse Triangular Solve , 2002 .

[212] Christoph W. Ueberhuber,et al. Cache Oblivious High Performance Algorithms for Matrix Multiplication , 2002 .

[213] Gerhard Wellein,et al. Fast Sparse Matrix-Vector Multiplication for TeraFlop/s Computers , 2002, VECPAR.

[214] Jeffrey K. Hollingsworth,et al. SIGMA: A Simulator Infrastructure to Guide Memory Analysis , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[215] Sivan Toledo,et al. Nested-Dissection Orderings for Sparse LU with Partial Pivoting , 2002, SIAM J. Matrix Anal. Appl..

[216] I-Hsin Chung,et al. Active Harmony: Towards Automated Performance Tuning , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[217] Jean-Guillaume Dumas,et al. Finite field linear algebra subroutines , 2002, ISSAC '02.

[218] David A. Padua,et al. MaJIC: compiling MATLAB for speed and responsiveness , 2002, PLDI '02.

[219] Dror Rawitz,et al. The hardness of cache conscious data placement , 2002, POPL '02.

[220] Eli Upfal,et al. Using PageRank to Characterize Web Structure , 2002, COCOON.

[221] Sanjukta Bhowmick,et al. A Combinatorial Scheme for Developing Efficient Composite Solvers , 2002, International Conference on Computational Science.

[222] Masha Sosonkina,et al. Parallel Iterative Methods in Modern Physical Applications , 2002, International Conference on Computational Science.

[223] David E. Bernholdt,et al. Space-time trade-off optimization for a class of electronic structure calculations , 2002, PLDI '02.

[224] A. Rozga,et al. Maternal sensitivity and attachment in atypical groups. , 2002, Advances in child development and behavior.

[225] Dror Irony,et al. Parallel and Fully Recursive Multifrontal Supernodal Sparse Cholesky , 2002, International Conference on Computational Science.

[226] Jasmine Novak,et al. PageRank Computation and the Structure of the Web: Experiments and Algorithms , 2002 .

[227] John M. Mellor-Crummey,et al. Experiences tuning SMG98: a semicoarsening multigrid benchmark based on the hypre library , 2002, ICS '02.

[228] Paul N. Hilfinger,et al. Better Tiling and Array Contraction for Compiling Scientific Programs , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[229] Iain S. Duff,et al. Algorithm 818: A reference model implementation of the sparse BLAS in fortran 95 , 2002, TOMS.

[230] Jeffrey Scott Vitter,et al. Efficient sorting using registers and caches , 2000, JEAL.

[231] M. Gilli,et al. Solving finite difference schemes arising in trivariate option pricing , 2002 .

[232] Taher H. Haveliwala. Topic-sensitive PageRank , 2002, IEEE Trans. Knowl. Data Eng..

[233] Gerth Stølting Brodal,et al. Cache oblivious search trees via binary trees of small height , 2001, SODA '02.

[234] C. Leopold. Tight Bounds on Capacity Misses for 3D Stencil Codes , 2002 .

[235] Jeffrey S. Vetter,et al. An Empirical Performance Evaluation of Scalable Scientific Applications , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[236] Michael A. Bender,et al. Cache-oblivious priority queue and graph algorithm applications , 2002, STOC '02.

[237] Zizhong Chen,et al. Self-Adapting Software for Numerical Linear Algebra Library Routines on Clusters , 2003, International Conference on Computational Science.

[238] Gene H. Golub,et al. Extrapolation methods for accelerating PageRank computations , 2003, WWW '03.

[239] Li Chen,et al. Parallel Finite Element Analysis Platform for the Earth Simulator: GeoFEM , 2003, International Conference on Computational Science.

[240] David I. August,et al. Compiler optimization-space exploration , 2003, International Symposium on Code Generation and Optimization, 2003. CGO 2003..

[241] Zizhong Chen,et al. Self-adapting software for numerical linear algebra and LAPACK for clusters , 2003, Parallel Comput..

[242] Sally A. McKee,et al. METRIC: tracking down inefficiencies in the memory hierarchy via binary rewriting , 2003, International Symposium on Code Generation and Optimization, 2003. CGO 2003..

[243] James W. Thomas. Inlining of Mathematical Functions in HP-UX for Itanium ® 2 , 2003, CGO.

[244] Jennifer Widom,et al. Scaling personalized web search , 2003, WWW '03.

[245] Gene H. Golub,et al. Exploiting the Block Structure of the Web for Computing , 2003 .

[246] Vijay Kumar,et al. Efficient galois field arithmetic on SIMD architectures , 2003, SPAA '03.

[247] Victor Eijkhout,et al. Self-Adapting Numerical Software and Automatic Tuning of Heuristics , 2003, International Conference on Computational Science.

[248] John A. Tomlin,et al. A new paradigm for ranking pages on the world wide web , 2003, WWW '03.

[249] Taher H. Haveliwala,et al. The Second Eigenvalue of the Google Matrix , 2003 .

[250] Larry Carter,et al. Compile-time composition of run-time data and iteration reorderings , 2003, PLDI '03.

[251] Michael Franz,et al. Continuous program optimization: A case study , 2003, TOPL.

[252] J. Shalf,et al. Evaluation of Cache-based Superscalar and Cacheless Vector Architectures for Scientific Computations , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[253] Jeremy D. Frens,et al. QR factorization with Morton-ordered quadtree matrices for memory re-use and parallelism , 2003, PPoPP '03.

[254] Yunheung Paek,et al. Finding effective optimization phase sequences , 2003 .

[255] Jean-Francois Collard,et al. Optimizations to prevent cache penalties for the Intel® Itanium® 2 Processor , 2003, CGO.

[256] F. Petrini,et al. The Case of the Missing Supercomputer Performance: Achieving Optimal Performance on the 8,192 Processors of ASCI Q , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[257] Gang Ren,et al. A comparison of empirical and model-driven optimization , 2003, PLDI '03.

[258] James W. Thomas. Inlining of mathematical functions in HP-UX for Itanium/sup /spl reg// 2 , 2003, International Symposium on Code Generation and Optimization, 2003. CGO 2003..

[259] Saman P. Amarasinghe,et al. Meta optimization: improving compiler heuristics with machine learning , 2003, PLDI '03.

[260] Chandra Krintz. Coupling on-line and off-line profile information to improve program performance , 2003, International Symposium on Code Generation and Optimization, 2003. CGO 2003..

[261] Derek Bruening,et al. An infrastructure for adaptive dynamic optimization , 2003, International Symposium on Code Generation and Optimization, 2003. CGO 2003..

[262] S. Kirkland. Conditioning properties of the stationary distribution for a Markov chain , 2003 .

[263] Pedro C. Diniz. A Compiler Approach to Performance Prediction Using Empirical-Based Modeling , 2003, International Conference on Computational Science.

[264] Bernhard Schölkopf,et al. A tutorial on support vector regression , 2004, Stat. Comput..

[265] Shang-Hua Teng,et al. Recovering Mesh Geometry from a Stiffness Matrix , 2002, Numerical Algorithms.

[266] Richard W. Vuduc,et al. Sparsity: Optimization Framework for Sparse Matrix Kernels , 2004, Int. J. High Perform. Comput. Appl..

[267] Timothy A. Davis,et al. A column approximate minimum degree ordering algorithm , 2000, TOMS.

[268] Keith D. Cooper,et al. Adaptive Optimizing Compilers for the 21st Century , 2002, The Journal of Supercomputing.

[269] Darren J. Wilkinson,et al. A sparse matrix approach to Bayesian computation in large linear models , 2004, Comput. Stat. Data Anal..

[270] Vivek Sarkar. Optimized Unrolling of Nested Loops , 2004, International Journal of Parallel Programming.

[271] Larry Carter,et al. Quantifying the Multi-Level Nature of Tiling Interactions , 1997, International Journal of Parallel Programming.

[272] Allen,et al. Optimizing Compilers for Modern Architectures , 2004 .

[273] A data locality optimizing algorithm , 2004, SIGP.

[274] Robert A. van de Geijn,et al. A Family of High-Performance Matrix Multiplication Algorithms , 2004, PARA.

[275] Taher H. Haveliwala,et al. Adaptive methods for the computation of PageRank , 2004 .

[276] Amy Nicole Langville,et al. A Survey of Eigenvector Methods for Web Information Retrieval , 2005, SIAM Rev..

[277] Elizabeth R. Jessup,et al. A Technique for Accelerating the Convergence of Restarted GMRES , 2005, SIAM J. Matrix Anal. Appl..

[278] Bernard Philippe,et al. Numerical Methods in Markov Chain Modeling , 1992, Oper. Res..