论文信息 - An Optimized Sparse Approximate Matrix Multiply for Matrices with Decay

An Optimized Sparse Approximate Matrix Multiply for Matrices with Decay

We present an optimized single-precision implementation of the Sparse Approximate Matrix Multiply (\SpAMM{}) [M. Challacombe and N. Bock, arXiv {\bf 1011.3534} (2010)], a fast algorithm for matrix-matrix multiplication for matrices with decay that achieves an $\mathcal{O} (n \log n)$ computational complexity with respect to matrix dimension $n$. We find that the max norm of the error achieved with a \SpAMM{} tolerance below $2 \times 10^{-8}$ is lower than that of the single-precision {\tt SGEMM} for dense quantum chemical matrices, while outperforming {\tt SGEMM} with a cross-over already for small matrices ($n \sim 1000$). Relative to naive implementations of \SpAMM{} using Intel's Math Kernel Library ({\tt MKL}) or AMD's Core Math Library ({\tt ACML}), our optimized version is found to be significantly faster. Detailed performance comparisons are made for quantum chemical matrices with differently structured sub-blocks. Finally, we discuss the potential of improved hardware prefetch to yield 2--3x speedups.

Matt Challacombe | Nicolas Bock | N. Bock | M. Challacombe

[1] M. Head‐Gordon,et al. A multipole acceptability criterion for electronic structure theory , 1998 .

[2] Michael J. Frisch,et al. A linear scaling method for Hartree–Fock exchange calculations of large molecules , 1996 .

[3] D. Hilbert. Über die stetige Abbildung einer Linie auf ein Flächenstück , 1935 .

[4] P. Colella,et al. Local adaptive mesh refinement for shock hydrodynamics , 1989 .

[5] John Strain,et al. The Fast Gauss Transform with Variable Scales , 1991, SIAM J. Sci. Comput..

[6] G. Peano. Sur une courbe, qui remplit toute une aire plane , 1890 .

[7] Hanan Samet,et al. Foundations of multidimensional and metric data structures , 2006, Morgan Kaufmann series in data management systems.

[8] Anastasia Ailamaki,et al. Improving hash join performance through prefetching , 2004, Proceedings. 20th International Conference on Data Engineering.

[9] Zhimin Gu,et al. Performance comparison of data prefetching for pointer-chasing applications , 2009, 2009 First International Conference on Information Science and Engineering.

[10] Per Stenström,et al. A prefetching technique for irregular accesses to linked data structures , 2000, Proceedings Sixth International Symposium on High-Performance Computer Architecture. HPCA-6 (Cat. No.PR00550).

[11] Irene Gargantini,et al. An effective way to represent quadtrees , 1982, CACM.

[12] Vlastimil Havran,et al. Hierarchical visibility culling with occlusion trees , 1998, Proceedings. Computer Graphics International (Cat. No.98EX149).

[13] Eric Schwegler,et al. Fast assembly of the Coulomb matrix: A quantum chemical tree code , 1996 .

[14] Gustavo E. Scuseria,et al. Linear scaling conjugate gradient density matrix search as an alternative to diagonalization for first principles electronic structure calculations , 1997 .

[15] James Demmel,et al. Stability of block algorithms with fast level-3 BLAS , 1992, TOMS.

[16] L. Greengard,et al. Regular Article: A Fast Adaptive Multipole Algorithm in Three Dimensions , 1999 .

[17] Jeremy D. Frens,et al. Language support for Morton-order matrices , 2001, PPoPP '01.

[18] Margaret H. Dunham,et al. Join processing in relational databases , 1992, CSUR.

[19] Michele Benzi,et al. Decay Properties of Spectral Projectors with Applications to Electronic Structure , 2012, SIAM Rev..

[20] Sriram Krishnamoorthy,et al. Data-driven fault tolerance for work stealing computations , 2012, ICS '12.

[21] Michael S. Warren,et al. A Parallel, Portable and Versatile Treecode , 1995, PPSC.

[22] David J. DeWitt,et al. Tradeoffs in Processing Complex Join Queries via Hashing in Multiprocessor Database Machines , 1990, VLDB.

[23] Hanan Samet,et al. A Fast Similarity Join Algorithm Using Graphics Processing Units , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[24] Bruno Arnaldi,et al. New trends in collision detection performance , 2009 .

[25] Leslie Greengard,et al. A fast algorithm for particle simulations , 1987 .

[26] Giulia Galli,et al. Linear scaling methods for electronic structure calculations and quantum molecular dynamics simulations , 1996 .

[27] Yang,et al. Direct calculation of electron density in density-functional theory. , 1991, Physical review letters.

[28] Dario Bini,et al. Stability of fast algorithms for matrix multiplication , 1980 .

[29] Laxmikant V. Kalé,et al. Work stealing and persistence-based load balancers for iterative overdecomposed applications , 2012, HPDC '12.

[30] Karen D. Devinea,et al. New Challenges in Dynamic Load Balancing , 2004 .

[31] Paul H. J. Kelly,et al. Improving the Performance of Morton Layout by Array Alignment and Loop Unrolling: Reducing the Price of Naivety , 2003, LCPC.

[32] Robert A. van de Geijn,et al. High-performance implementation of the level-3 BLAS , 2008, TOMS.

[33] John R. Gilbert,et al. Parallel Sparse Matrix-Matrix Multiplication and Indexing: Implementation and Experiments , 2011, SIAM J. Sci. Comput..

[34] A. V. Duin,et al. A Divide-and-Conquer/Cellular-Decomposition Framework for Million-to-Billion Atom Simulations of Chemical Reactions , 2007 .

[35] George Roussos,et al. A New Error Estimate of the Fast Gauss Transform , 2002, SIAM J. Sci. Comput..

[36] Lars Karlsson,et al. Parallel and Cache-Efficient In-Place Matrix Storage Format Conversion , 2012, TOMS.

[37] L. Greengard,et al. A new version of the Fast Multipole Method for the Laplace equation in three dimensions , 1997, Acta Numerica.

[38] Michele Benzi,et al. Orderings for Factorized Sparse Approximate Inverse Preconditioners , 1999, SIAM J. Sci. Comput..

[39] George E. Karniadakis,et al. A sharp error estimate for the fast Gauss transform , 2006, J. Comput. Phys..

[40] Michael S. Warren,et al. Astrophysical N-body simulations using hierarchical tree data structures , 1992, Proceedings Supercomputing '92.

[41] A. Szabó,et al. Modern quantum chemistry : introduction to advanced electronic structure theory , 1982 .

[42] Robert A. van de Geijn,et al. Anatomy of high-performance matrix multiplication , 2008, TOMS.

[43] David E. Manolopoulos,et al. Canonical purification of the density matrix in electronic-structure theory , 1998 .

[44] Jeremy D. Frens,et al. Auto-blocking matrix-multiplication or tracking BLAS3 performance from source code , 1997, PPOPP '97.

[45] Richard E. Ladner,et al. Algorithms to Take Advantage of Hardware Prefetching , 2007, ALENEX.

[46] Eric Schwegler,et al. Modern Developments in Hartree-Fock Theory: Fast Methods for Computing the Coulomb Matrix , 1996 .

[47] Keshav Pingali,et al. Is Cache-Oblivious DGEMM Viable? , 2006, PARA.

[48] Srinivas Aluru,et al. Parallel domain decomposition and load balancing using space-filling curves , 1997, Proceedings Fourth International Conference on High-Performance Computing.

[49] Hanan Samet,et al. Iterative spatial join , 2003, TODS.

[50] Chia-Lin Yang,et al. Push vs. pull: data movement for linked data structures , 2000, ICS '00.

[51] Min Zhou,et al. Experiences and lessons learned with a portable interface to hardware performance counters , 2003, Proceedings International Parallel and Distributed Processing Symposium.

[52] Joost VandeVondele,et al. Linear Scaling Self-Consistent Field Calculations with Millions of Atoms in the Condensed Phase. , 2012, Journal of chemical theory and computation.

[53] R. Mcweeny,et al. The density matrix in self-consistent field theory I. Iterative construction of the density matrix , 1956, Proceedings of the Royal Society of London. Series A. Mathematical and Physical Sciences.

[54] Lynn Elliot Cannon,et al. A cellular computer to implement the kalman filter algorithm , 1969 .

[55] Michael S. Warren,et al. A portable parallel particle program , 1995 .

[56] M. Challacombe. A general parallel sparse-blocked matrix multiply for linear scaling SCF theory , 2000 .

[57] S. Goedecker. Electronic Structure Methods Exhibiting Linear Scaling of the Computational Effort with Respect to the Size of the System , 1998, cond-mat/9806073.

[58] Hanan Samet,et al. Data-Parallel Spatial Join Algorithms , 1994, 1994 International Conference on Parallel Processing Vol. 3.

[59] Virginia Vassilevska Williams,et al. Multiplying matrices faster than coppersmith-winograd , 2012, STOC '12.

[60] Wei-Fen Lin,et al. Reducing DRAM latencies with an integrated memory hierarchy design , 2001, Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture.

[61] Eric Schwegler,et al. Linear scaling computation of the Fock matrix. II. Rigorous bounds on exchange integrals and incremental Fock build , 1997 .

[62] Donald Yeung,et al. The Efficacy of Software Prefetching and Locality Optimizations on Future Memory Systems , 2004, J. Instr. Level Parallelism.

[63] S. Chandrasekaran,et al. A fast adaptive solver for hierarchically semiseparable representations , 2005 .

[64] Guohua Jin,et al. Using Space-filling Curves for Computation Reordering , 2005 .

[65] V. Strassen. Gaussian elimination is not optimal , 1969 .

[66] Wolfgang Hackbusch,et al. Construction and Arithmetics of H-Matrices , 2003, Computing.

[67] Esteban Hernandez Barragan,et al. Performance analisys on multicore system using PAPI , 2011, 2011 6th Colombian Computing Congress (CCC).

[68] Stamatis Vassiliadis,et al. SAD Prefetching for MPEG4 Using Flux Caches , 2006, SAMOS.

[69] Arieh Iserles. How Large is the Exponential of a Banded Matrix , 1999 .

[70] D. Truhlar. Recent progress in density functional theory , 2014 .

[71] Michael S. Warren,et al. A parallel hashed oct-tree N-body algorithm , 1993, Supercomputing '93. Proceedings.

[72] Michael Bader,et al. Hardware-Oriented Implementation of Cache Oblivious Matrix Operations Based on Space-Filling Curves , 2007, PPAM.

[73] George Ho,et al. PAPI: A Portable Interface to Hardware Performance Counters , 1999 .

[74] Matt Challacombe,et al. Linear scaling computation of the Fock matrix. VI. Data parallel computation of the exchange-correlation matrix , 2003 .

[75] Shivkumar Chandrasekaran,et al. A Fast ULV Decomposition Solver for Hierarchically Semiseparable Representations , 2006, SIAM J. Matrix Anal. Appl..

[76] Ken Kennedy,et al. Improving Memory Hierarchy Performance for Irregular Applications Using Data and Computation Reorderings , 2001, International Journal of Parallel Programming.

[77] John R. Gilbert,et al. On the representation and multiplication of hypersparse matrices , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[78] Leslie Greengard,et al. The Fast Gauss Transform , 1991, SIAM J. Sci. Comput..

[79] Stefan Goedecker,et al. Linear scaling electronic structure methods in chemistry and physics , 2003, Comput. Sci. Eng..

[80] Jon Louis Bentley,et al. Quad trees a data structure for retrieval on composite keys , 1974, Acta Informatica.

[81] Eric Schwegler,et al. Linear scaling computation of the Fock matrix , 1997 .

[82] Greg Stitt,et al. Recursion flattening , 2008, GLSVLSI '08.

[83] Marsha Berger,et al. Three-Dimensional Adaptive Mesh Refinement for Hyperbolic Conservation Laws , 1994, SIAM J. Sci. Comput..

[84] James Demmel,et al. Fast matrix multiplication is stable , 2006, Numerische Mathematik.

[85] Christian Ochsenfeld,et al. Linear and sublinear scaling formation of Hartree-Fock-type exchange matrices , 1998 .

[86] William F. Moss,et al. Decay rates for inverses of band matrices , 1984 .

[87] Tai-Sung Lee,et al. Linear-scaling quantum mechanical calculations of biological molecules: The divide-and-conquer approach , 1998 .

[88] Peiyi Tang,et al. Complete inlining of recursive calls: beyond tail-recursion elimination , 2006, ACM-SE 44.

[89] Sylvain Lefebvre,et al. Perfect spatial hashing , 2006, SIGGRAPH 2006.

[90] Peter H. Beckman. Parallel LU decomposition for sparse matrices using quadtrees on a shared-heap multiprocessor , 1993 .

[91] Ramani Duraiswami,et al. Improved Fast Gauss Transform , 2003 .

[92] Matt Challacombe,et al. Linear scaling computation of the Fock matrix. V. Hierarchical Cubature for numerical integration of the exchange-correlation matrix , 2000 .

[93] Thomas Lewiner,et al. Fast Generation of Pointerless Octree Duals , 2010, Comput. Graph. Forum.

[94] P. Campbell,et al. Dynamic Octree Load Balancing Using Space-Filling Curves ∗ , 2003 .

[95] Hanan Samet,et al. The Design and Analysis of Spatial Data Structures , 1989 .

[96] Matt Challacombe,et al. Fast Multiplication of Matrices with Decay , 2010, ArXiv.

[97] David R. Bowler,et al. Recent progress in linear scaling ab initio electronic structure techniques , 2002 .

[98] V. Rokhlin,et al. Rapid Evaluation of Potential Fields in Three Dimensions , 1988 .

[99] S. Goedecker. Linear scaling electronic structure methods , 1999 .

[100] Don Coppersmith,et al. Matrix multiplication via arithmetic progressions , 1987, STOC.

[101] Raphael Yuster,et al. Fast sparse matrix multiplication , 2004, TALG.

[102] John R. Gilbert,et al. Challenges and Advances in Parallel Sparse Matrix-Matrix Multiplication , 2008, 2008 37th International Conference on Parallel Processing.

[103] D. Bowler,et al. O(N) methods in electronic structure calculations. , 2011, Reports on progress in physics. Physical Society.

[104] Gustavo E. Scuseria,et al. Semiempirical methods with conjugate gradient density matrix search to replace diagonalization for molecular systems containing thousands of atoms , 1997 .

[105] R. Mcweeny,et al. The density matrix in self-consistent field theory II. Applications in the molecular orbital theory of conjugated systems , 1956, Proceedings of the Royal Society of London. Series A. Mathematical and Physical Sciences.

[106] Pradeep Dubey,et al. Sort vs. Hash Revisited: Fast Join Implementation on Modern Multi-Core CPUs , 2009, Proc. VLDB Endow..

[107] Matt Challacombe,et al. A simplified density matrix minimization for linear scaling self-consistent field theory , 1999 .

[108] M. Benzi,et al. DECAY BOUNDS AND ( ) ALGORITHMS FOR APPROXIMATING FUNCTIONS OF SPARSE MATRICES , 2007 .

[109] Keshav Pingali,et al. An experimental comparison of cache-oblivious and cache-conscious programs , 2007, SPAA '07.

[110] James Coole,et al. Traversal caches: a first step towards FPGA acceleration of pointer-based data structures , 2008, CODES+ISSS '08.

[111] Sally A. McKee,et al. Can hardware performance counters be trusted? , 2008, 2008 IEEE International Symposium on Workload Characterization.

[112] L. Hernquist. Hierarchical N-body methods , 1987 .

[113] Weitao Yang,et al. A density‐matrix divide‐and‐conquer approach for electronic structure calculations of large molecules , 1995 .

[114] M. Berger,et al. Adaptive mesh refinement for hyperbolic partial differential equations , 1982 .

[115] Anders M. N. Niklasson,et al. Trace resetting density matrix purification in O(N) self-consistent-field theory , 2003 .

[116] David S. Wise. Ahnentafel Indexing into Morton-Ordered Arrays, or Matrix Locality for Free , 2000, Euro-Par.

[117] Matt Challacombe,et al. Density matrix perturbation theory. , 2003, Physical review letters.

[118] David S. Wise,et al. The Opie compiler from row-major source to Morton-ordered matrices , 2004, WMPI '04.

[119] Rasmus Pagh,et al. Faster join-projects and sparse matrix multiplications , 2009, ICDT '09.

[120] James R. Larus,et al. Making Pointer-Based Data Structures Cache Conscious , 2000, Computer.

[122] Siddhartha Chatterjee,et al. The Combinatorics of Cache Misses during Matrix Multiplication , 2001, J. Comput. Syst. Sci..

[123] G. Golub,et al. Bounds for the Entries of Matrix Functions with Applications to Preconditioning , 1999 .

[124] Jehee Lee,et al. Linkless Octree Using Multi‐Level Perfect Hashing , 2009, Comput. Graph. Forum.

[125] David S. Wise,et al. Analyzing block locality in Morton-order and Morton-hybrid matrices , 2007, CARN.

[126] David S. Wise,et al. Seven at one stroke: results from a cache-oblivious paradigm for scalable matrix algorithms , 2006, MSPC '06.

[127] David S. Wise,et al. Experiments with Quadtree Representation of Matrices , 1988, ISSAC.

[128] David S. Wise. Representing matrices as quadtrees for parallel processors: extended abstract , 1984, SIGS.

[129] David S. Wise,et al. Costs of Quadtree Representation of Nondense Matrices , 1990, J. Parallel Distributed Comput..

[130] Robert A. van de Geijn,et al. SUMMA: scalable universal matrix multiplication algorithm , 1995, Concurr. Pract. Exp..

[131] Kathryn S. McKinley,et al. Guided region prefetching: a cooperative hardware/software approach , 2003, ISCA '03.

[132] David S. Wise,et al. Representation-transparent matrix algorithms with scalable performance , 2007, ICS '07.

[133] He Yong. Survey on real-time collision detection algorithms , 2008 .