An Optimized Sparse Approximate Matrix Multiply for Matrices with Decay

We present an optimized single-precision implementation of the Sparse Approximate Matrix Multiply (\SpAMM{}) [M. Challacombe and N. Bock, arXiv {\bf 1011.3534} (2010)], a fast algorithm for matrix-matrix multiplication for matrices with decay that achieves an $\mathcal{O} (n \log n)$ computational complexity with respect to matrix dimension $n$. We find that the max norm of the error achieved with a \SpAMM{} tolerance below $2 \times 10^{-8}$ is lower than that of the single-precision {\tt SGEMM} for dense quantum chemical matrices, while outperforming {\tt SGEMM} with a cross-over already for small matrices ($n \sim 1000$). Relative to naive implementations of \SpAMM{} using Intel's Math Kernel Library ({\tt MKL}) or AMD's Core Math Library ({\tt ACML}), our optimized version is found to be significantly faster. Detailed performance comparisons are made for quantum chemical matrices with differently structured sub-blocks. Finally, we discuss the potential of improved hardware prefetch to yield 2--3x speedups.

[1]  M. Head‐Gordon,et al.  A multipole acceptability criterion for electronic structure theory , 1998 .

[2]  Michael J. Frisch,et al.  A linear scaling method for Hartree–Fock exchange calculations of large molecules , 1996 .

[3]  D. Hilbert Über die stetige Abbildung einer Linie auf ein Flächenstück , 1935 .

[4]  P. Colella,et al.  Local adaptive mesh refinement for shock hydrodynamics , 1989 .

[5]  John Strain,et al.  The Fast Gauss Transform with Variable Scales , 1991, SIAM J. Sci. Comput..

[6]  G. Peano Sur une courbe, qui remplit toute une aire plane , 1890 .

[7]  Hanan Samet,et al.  Foundations of multidimensional and metric data structures , 2006, Morgan Kaufmann series in data management systems.

[8]  Anastasia Ailamaki,et al.  Improving hash join performance through prefetching , 2004, Proceedings. 20th International Conference on Data Engineering.

[9]  Zhimin Gu,et al.  Performance comparison of data prefetching for pointer-chasing applications , 2009, 2009 First International Conference on Information Science and Engineering.

[10]  Per Stenström,et al.  A prefetching technique for irregular accesses to linked data structures , 2000, Proceedings Sixth International Symposium on High-Performance Computer Architecture. HPCA-6 (Cat. No.PR00550).

[11]  Irene Gargantini,et al.  An effective way to represent quadtrees , 1982, CACM.

[12]  Vlastimil Havran,et al.  Hierarchical visibility culling with occlusion trees , 1998, Proceedings. Computer Graphics International (Cat. No.98EX149).

[13]  Eric Schwegler,et al.  Fast assembly of the Coulomb matrix: A quantum chemical tree code , 1996 .

[14]  Gustavo E. Scuseria,et al.  Linear scaling conjugate gradient density matrix search as an alternative to diagonalization for first principles electronic structure calculations , 1997 .

[15]  James Demmel,et al.  Stability of block algorithms with fast level-3 BLAS , 1992, TOMS.

[16]  L. Greengard,et al.  Regular Article: A Fast Adaptive Multipole Algorithm in Three Dimensions , 1999 .

[17]  Jeremy D. Frens,et al.  Language support for Morton-order matrices , 2001, PPoPP '01.

[18]  Margaret H. Dunham,et al.  Join processing in relational databases , 1992, CSUR.

[19]  Michele Benzi,et al.  Decay Properties of Spectral Projectors with Applications to Electronic Structure , 2012, SIAM Rev..

[20]  Sriram Krishnamoorthy,et al.  Data-driven fault tolerance for work stealing computations , 2012, ICS '12.

[21]  Michael S. Warren,et al.  A Parallel, Portable and Versatile Treecode , 1995, PPSC.

[22]  David J. DeWitt,et al.  Tradeoffs in Processing Complex Join Queries via Hashing in Multiprocessor Database Machines , 1990, VLDB.

[23]  Hanan Samet,et al.  A Fast Similarity Join Algorithm Using Graphics Processing Units , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[24]  Bruno Arnaldi,et al.  New trends in collision detection performance , 2009 .

[25]  Leslie Greengard,et al.  A fast algorithm for particle simulations , 1987 .

[26]  Giulia Galli,et al.  Linear scaling methods for electronic structure calculations and quantum molecular dynamics simulations , 1996 .

[27]  Yang,et al.  Direct calculation of electron density in density-functional theory. , 1991, Physical review letters.

[28]  Dario Bini,et al.  Stability of fast algorithms for matrix multiplication , 1980 .

[29]  Laxmikant V. Kalé,et al.  Work stealing and persistence-based load balancers for iterative overdecomposed applications , 2012, HPDC '12.

[30]  Karen D. Devinea,et al.  New Challenges in Dynamic Load Balancing , 2004 .

[31]  Paul H. J. Kelly,et al.  Improving the Performance of Morton Layout by Array Alignment and Loop Unrolling: Reducing the Price of Naivety , 2003, LCPC.

[32]  Robert A. van de Geijn,et al.  High-performance implementation of the level-3 BLAS , 2008, TOMS.

[33]  John R. Gilbert,et al.  Parallel Sparse Matrix-Matrix Multiplication and Indexing: Implementation and Experiments , 2011, SIAM J. Sci. Comput..

[34]  A. V. Duin,et al.  A Divide-and-Conquer/Cellular-Decomposition Framework for Million-to-Billion Atom Simulations of Chemical Reactions , 2007 .

[35]  George Roussos,et al.  A New Error Estimate of the Fast Gauss Transform , 2002, SIAM J. Sci. Comput..

[36]  Lars Karlsson,et al.  Parallel and Cache-Efficient In-Place Matrix Storage Format Conversion , 2012, TOMS.

[37]  L. Greengard,et al.  A new version of the Fast Multipole Method for the Laplace equation in three dimensions , 1997, Acta Numerica.

[38]  Michele Benzi,et al.  Orderings for Factorized Sparse Approximate Inverse Preconditioners , 1999, SIAM J. Sci. Comput..

[39]  George E. Karniadakis,et al.  A sharp error estimate for the fast Gauss transform , 2006, J. Comput. Phys..

[40]  Michael S. Warren,et al.  Astrophysical N-body simulations using hierarchical tree data structures , 1992, Proceedings Supercomputing '92.

[41]  A. Szabó,et al.  Modern quantum chemistry : introduction to advanced electronic structure theory , 1982 .

[42]  Robert A. van de Geijn,et al.  Anatomy of high-performance matrix multiplication , 2008, TOMS.

[43]  David E. Manolopoulos,et al.  Canonical purification of the density matrix in electronic-structure theory , 1998 .

[44]  Jeremy D. Frens,et al.  Auto-blocking matrix-multiplication or tracking BLAS3 performance from source code , 1997, PPOPP '97.

[45]  Richard E. Ladner,et al.  Algorithms to Take Advantage of Hardware Prefetching , 2007, ALENEX.

[46]  Eric Schwegler,et al.  Modern Developments in Hartree-Fock Theory: Fast Methods for Computing the Coulomb Matrix , 1996 .

[47]  Keshav Pingali,et al.  Is Cache-Oblivious DGEMM Viable? , 2006, PARA.

[48]  Srinivas Aluru,et al.  Parallel domain decomposition and load balancing using space-filling curves , 1997, Proceedings Fourth International Conference on High-Performance Computing.

[49]  Hanan Samet,et al.  Iterative spatial join , 2003, TODS.

[50]  Chia-Lin Yang,et al.  Push vs. pull: data movement for linked data structures , 2000, ICS '00.

[51]  Min Zhou,et al.  Experiences and lessons learned with a portable interface to hardware performance counters , 2003, Proceedings International Parallel and Distributed Processing Symposium.

[52]  Joost VandeVondele,et al.  Linear Scaling Self-Consistent Field Calculations with Millions of Atoms in the Condensed Phase. , 2012, Journal of chemical theory and computation.

[53]  R. Mcweeny,et al.  The density matrix in self-consistent field theory I. Iterative construction of the density matrix , 1956, Proceedings of the Royal Society of London. Series A. Mathematical and Physical Sciences.

[54]  Lynn Elliot Cannon,et al.  A cellular computer to implement the kalman filter algorithm , 1969 .

[55]  Michael S. Warren,et al.  A portable parallel particle program , 1995 .

[56]  M. Challacombe A general parallel sparse-blocked matrix multiply for linear scaling SCF theory , 2000 .

[57]  S. Goedecker Electronic Structure Methods Exhibiting Linear Scaling of the Computational Effort with Respect to the Size of the System , 1998, cond-mat/9806073.

[58]  Hanan Samet,et al.  Data-Parallel Spatial Join Algorithms , 1994, 1994 International Conference on Parallel Processing Vol. 3.

[59]  Virginia Vassilevska Williams,et al.  Multiplying matrices faster than coppersmith-winograd , 2012, STOC '12.

[60]  Wei-Fen Lin,et al.  Reducing DRAM latencies with an integrated memory hierarchy design , 2001, Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture.

[61]  Eric Schwegler,et al.  Linear scaling computation of the Fock matrix. II. Rigorous bounds on exchange integrals and incremental Fock build , 1997 .

[62]  Donald Yeung,et al.  The Efficacy of Software Prefetching and Locality Optimizations on Future Memory Systems , 2004, J. Instr. Level Parallelism.

[63]  S. Chandrasekaran,et al.  A fast adaptive solver for hierarchically semiseparable representations , 2005 .

[64]  Guohua Jin,et al.  Using Space-filling Curves for Computation Reordering , 2005 .

[65]  V. Strassen Gaussian elimination is not optimal , 1969 .

[66]  Wolfgang Hackbusch,et al.  Construction and Arithmetics of H-Matrices , 2003, Computing.

[67]  Esteban Hernandez Barragan,et al.  Performance analisys on multicore system using PAPI , 2011, 2011 6th Colombian Computing Congress (CCC).

[68]  Stamatis Vassiliadis,et al.  SAD Prefetching for MPEG4 Using Flux Caches , 2006, SAMOS.

[69]  Arieh Iserles How Large is the Exponential of a Banded Matrix , 1999 .

[70]  D. Truhlar Recent progress in density functional theory , 2014 .

[71]  Michael S. Warren,et al.  A parallel hashed oct-tree N-body algorithm , 1993, Supercomputing '93. Proceedings.

[72]  Michael Bader,et al.  Hardware-Oriented Implementation of Cache Oblivious Matrix Operations Based on Space-Filling Curves , 2007, PPAM.

[73]  George Ho,et al.  PAPI: A Portable Interface to Hardware Performance Counters , 1999 .

[74]  Matt Challacombe,et al.  Linear scaling computation of the Fock matrix. VI. Data parallel computation of the exchange-correlation matrix , 2003 .

[75]  Shivkumar Chandrasekaran,et al.  A Fast ULV Decomposition Solver for Hierarchically Semiseparable Representations , 2006, SIAM J. Matrix Anal. Appl..

[76]  Ken Kennedy,et al.  Improving Memory Hierarchy Performance for Irregular Applications Using Data and Computation Reorderings , 2001, International Journal of Parallel Programming.

[77]  John R. Gilbert,et al.  On the representation and multiplication of hypersparse matrices , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[78]  Leslie Greengard,et al.  The Fast Gauss Transform , 1991, SIAM J. Sci. Comput..

[79]  Stefan Goedecker,et al.  Linear scaling electronic structure methods in chemistry and physics , 2003, Comput. Sci. Eng..

[80]  Jon Louis Bentley,et al.  Quad trees a data structure for retrieval on composite keys , 1974, Acta Informatica.

[81]  Eric Schwegler,et al.  Linear scaling computation of the Fock matrix , 1997 .

[82]  Greg Stitt,et al.  Recursion flattening , 2008, GLSVLSI '08.

[83]  Marsha Berger,et al.  Three-Dimensional Adaptive Mesh Refinement for Hyperbolic Conservation Laws , 1994, SIAM J. Sci. Comput..

[84]  James Demmel,et al.  Fast matrix multiplication is stable , 2006, Numerische Mathematik.

[85]  Christian Ochsenfeld,et al.  Linear and sublinear scaling formation of Hartree-Fock-type exchange matrices , 1998 .

[86]  William F. Moss,et al.  Decay rates for inverses of band matrices , 1984 .

[87]  Tai-Sung Lee,et al.  Linear-scaling quantum mechanical calculations of biological molecules: The divide-and-conquer approach , 1998 .

[88]  Peiyi Tang,et al.  Complete inlining of recursive calls: beyond tail-recursion elimination , 2006, ACM-SE 44.

[89]  Sylvain Lefebvre,et al.  Perfect spatial hashing , 2006, SIGGRAPH 2006.

[90]  Peter H. Beckman Parallel LU decomposition for sparse matrices using quadtrees on a shared-heap multiprocessor , 1993 .

[91]  Ramani Duraiswami,et al.  Improved Fast Gauss Transform , 2003 .

[92]  Matt Challacombe,et al.  Linear scaling computation of the Fock matrix. V. Hierarchical Cubature for numerical integration of the exchange-correlation matrix , 2000 .

[93]  Thomas Lewiner,et al.  Fast Generation of Pointerless Octree Duals , 2010, Comput. Graph. Forum.

[94]  P. Campbell,et al.  Dynamic Octree Load Balancing Using Space-Filling Curves ∗ , 2003 .

[95]  Hanan Samet,et al.  The Design and Analysis of Spatial Data Structures , 1989 .

[96]  Matt Challacombe,et al.  Fast Multiplication of Matrices with Decay , 2010, ArXiv.

[97]  David R. Bowler,et al.  Recent progress in linear scaling ab initio electronic structure techniques , 2002 .

[98]  V. Rokhlin,et al.  Rapid Evaluation of Potential Fields in Three Dimensions , 1988 .

[99]  S. Goedecker Linear scaling electronic structure methods , 1999 .

[100]  Don Coppersmith,et al.  Matrix multiplication via arithmetic progressions , 1987, STOC.

[101]  Raphael Yuster,et al.  Fast sparse matrix multiplication , 2004, TALG.

[102]  John R. Gilbert,et al.  Challenges and Advances in Parallel Sparse Matrix-Matrix Multiplication , 2008, 2008 37th International Conference on Parallel Processing.

[103]  D. Bowler,et al.  O(N) methods in electronic structure calculations. , 2011, Reports on progress in physics. Physical Society.

[104]  Gustavo E. Scuseria,et al.  Semiempirical methods with conjugate gradient density matrix search to replace diagonalization for molecular systems containing thousands of atoms , 1997 .

[105]  R. Mcweeny,et al.  The density matrix in self-consistent field theory II. Applications in the molecular orbital theory of conjugated systems , 1956, Proceedings of the Royal Society of London. Series A. Mathematical and Physical Sciences.

[106]  Pradeep Dubey,et al.  Sort vs. Hash Revisited: Fast Join Implementation on Modern Multi-Core CPUs , 2009, Proc. VLDB Endow..

[107]  Matt Challacombe,et al.  A simplified density matrix minimization for linear scaling self-consistent field theory , 1999 .

[108]  M. Benzi,et al.  DECAY BOUNDS AND ( ) ALGORITHMS FOR APPROXIMATING FUNCTIONS OF SPARSE MATRICES , 2007 .

[109]  Keshav Pingali,et al.  An experimental comparison of cache-oblivious and cache-conscious programs , 2007, SPAA '07.

[110]  James Coole,et al.  Traversal caches: a first step towards FPGA acceleration of pointer-based data structures , 2008, CODES+ISSS '08.

[111]  Sally A. McKee,et al.  Can hardware performance counters be trusted? , 2008, 2008 IEEE International Symposium on Workload Characterization.

[112]  L. Hernquist Hierarchical N-body methods , 1987 .

[113]  Weitao Yang,et al.  A density‐matrix divide‐and‐conquer approach for electronic structure calculations of large molecules , 1995 .

[114]  M. Berger,et al.  Adaptive mesh refinement for hyperbolic partial differential equations , 1982 .

[115]  Anders M. N. Niklasson,et al.  Trace resetting density matrix purification in O(N) self-consistent-field theory , 2003 .

[116]  David S. Wise Ahnentafel Indexing into Morton-Ordered Arrays, or Matrix Locality for Free , 2000, Euro-Par.

[117]  Matt Challacombe,et al.  Density matrix perturbation theory. , 2003, Physical review letters.

[118]  David S. Wise,et al.  The Opie compiler from row-major source to Morton-ordered matrices , 2004, WMPI '04.

[119]  Rasmus Pagh,et al.  Faster join-projects and sparse matrix multiplications , 2009, ICDT '09.

[120]  James R. Larus,et al.  Making Pointer-Based Data Structures Cache Conscious , 2000, Computer.

[122]  Siddhartha Chatterjee,et al.  The Combinatorics of Cache Misses during Matrix Multiplication , 2001, J. Comput. Syst. Sci..

[123]  G. Golub,et al.  Bounds for the Entries of Matrix Functions with Applications to Preconditioning , 1999 .

[124]  Jehee Lee,et al.  Linkless Octree Using Multi‐Level Perfect Hashing , 2009, Comput. Graph. Forum.

[125]  David S. Wise,et al.  Analyzing block locality in Morton-order and Morton-hybrid matrices , 2007, CARN.

[126]  David S. Wise,et al.  Seven at one stroke: results from a cache-oblivious paradigm for scalable matrix algorithms , 2006, MSPC '06.

[127]  David S. Wise,et al.  Experiments with Quadtree Representation of Matrices , 1988, ISSAC.

[128]  David S. Wise Representing matrices as quadtrees for parallel processors: extended abstract , 1984, SIGS.

[129]  David S. Wise,et al.  Costs of Quadtree Representation of Nondense Matrices , 1990, J. Parallel Distributed Comput..

[130]  Robert A. van de Geijn,et al.  SUMMA: scalable universal matrix multiplication algorithm , 1995, Concurr. Pract. Exp..

[131]  Kathryn S. McKinley,et al.  Guided region prefetching: a cooperative hardware/software approach , 2003, ISCA '03.

[132]  David S. Wise,et al.  Representation-transparent matrix algorithms with scalable performance , 2007, ICS '07.

[133]  He Yong Survey on real-time collision detection algorithms , 2008 .