Sparse computation data dependence simplification for efficient compiler-generated inspectors

This paper presents a combined compile-time and runtime loop-carried dependence analysis of sparse matrix codes and evaluates its performance in the context of wavefront parallellism. Sparse computations incorporate indirect memory accesses such as x[col[j]] whose memory locations cannot be determined until runtime. The key contributions of this paper are two compile-time techniques for significantly reducing the overhead of runtime dependence testing: (1) identifying new equality constraints that result in more efficient runtime inspectors, and (2) identifying subset relations between dependence constraints such that one dependence test subsumes another one that is therefore eliminated. New equality constraints discovery is enabled by taking advantage of domain-specific knowledge about index arrays, such as col[j]. These simplifications lead to automatically-generated inspectors that make it practical to parallelize such computations. We analyze our simplification methods for a collection of seven sparse computations. The evaluation shows our methods reduce the complexity of the runtime inspectors significantly. Experimental results for a collection of five large matrices show parallel speedups ranging from 2x to more than 8x running on a 8-core CPU.

[1]  Sivan Toledo,et al.  Elimination Structures in Scientific Computing , 2004, Handbook of Data Structures and Applications.

[2]  William Pugh,et al.  Iteration space slicing and its application to communication optimization , 1997, ICS '97.

[3]  Anoop Gupta,et al.  Parallel ICCG on a hierarchical memory multiprocessor - Addressing the triangular solve bottleneck , 1990, Parallel Comput..

[4]  Shoaib Kamil,et al.  Sympiler: Transforming Sparse Matrix Codes by Decoupling Symbolic Analysis , 2017, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.

[5]  Pradeep Dubey,et al.  Efficient Shared-Memory Implementation of High-Performance Conjugate Gradient Benchmark and its Application to Unstructured Matrices , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[6]  A. Peirce Computer Methods in Applied Mechanics and Engineering , 2010 .

[7]  Mary W. Hall,et al.  The Sparse Polyhedral Framework: Composing Compiler-Generated Inspector-Executor Code , 2018, Proceedings of the IEEE.

[8]  Endong Wang,et al.  Intel Math Kernel Library , 2014 .

[9]  Pradeep Dubey,et al.  Sparsifying Synchronization for High-Performance Shared-Memory Sparse Triangular Solver , 2014, ISC.

[10]  Andrew Lumsdaine,et al.  Sparselib++ v. 1.5 Sparse Matrix Class Library Reference Guide | NIST , 1996 .

[11]  John R. Gilbert,et al.  Predicting fill for sparse orthogonal factorization , 1986, JACM.

[12]  Timothy A. Davis,et al.  Accelerating sparse cholesky factorization on GPUs , 2014, IA3 '14.

[13]  Henny B. Sipma,et al.  What's Decidable About Arrays? , 2006, VMCAI.

[14]  Thomas Brandes The importance of direct dependences for automatic parallelization , 1988, ICS '88.

[15]  Alex Pothen,et al.  A Mapping Algorithm for Parallel Sparse Cholesky Factorization , 1993, SIAM J. Sci. Comput..

[16]  Rudolf Eigenmann,et al.  Optimizing irregular shared-memory applications for distributed-memory systems , 2006, PPoPP '06.

[17]  Martin Schulz,et al.  ARCHER: Effectively Spotting Data Races in Large OpenMP Applications , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[18]  Lawrence Rauchwerger,et al.  Hybrid Analysis: Static & Dynamic Memory Reference Analysis , 2004, International Journal of Parallel Programming.

[19]  David R. O'Hallaron,et al.  Languages, Compilers and Run-Time Systems for Scalable Computers , 1998, Springer US.

[20]  Joel H. Saltz,et al.  Run-time parallelization and scheduling of loops , 1989, SPAA '89.

[21]  Timothy A. Davis,et al.  The university of Florida sparse matrix collection , 2011, TOMS.

[22]  Larry Carter,et al.  An approach for code generation in the Sparse Polyhedral Framework , 2016, Parallel Comput..

[23]  D. Kershaw The incomplete Cholesky—conjugate gradient method for the iterative solution of systems of linear equations , 1978 .

[24]  Michael Garland,et al.  Implementing sparse matrix-vector multiplication on throughput-oriented processors , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[25]  Cesare Tinelli,et al.  Finding conflicting instances of quantified formulas in SMT , 2014, 2014 Formal Methods in Computer-Aided Design (FMCAD).

[26]  Tomofumi Yuki,et al.  Extending Index-Array Properties for Data Dependence Analysis , 2018, LCPC.

[27]  David I. August,et al.  Automatically exploiting cross-invocation parallelism using runtime information , 2013, Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[28]  Sven Verdoolaege,et al.  isl: An Integer Set Library for the Polyhedral Model , 2010, ICMS.

[29]  Ran Zheng,et al.  GPU-based multifrontal optimizing method in sparse Cholesky factorization , 2015, 2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP).

[30]  Hongbo Rong,et al.  Automating Wavefront Parallelization for Sparse Matrix Computations , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[31]  J. Gilbert Predicting Structure in Sparse Matrix Computations , 1994 .

[32]  Katherine Yelick,et al.  Automatic Performance Tuning and Analysis of Sparse Triangular Solve , 2002 .

[33]  Nancy M. Amato,et al.  Run-time methods for parallelizing partially parallel loops , 1995, ICS '95.

[34]  Nikolaj Bjørner,et al.  Efficient E-Matching for SMT Solvers , 2007, CADE.

[35]  Shoaib Kamil,et al.  ParSy: Inspection and Transformation of Sparse Matrix Computations for Parallelism , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.

[36]  Paul Feautrier,et al.  Fuzzy array dataflow analysis , 1995, PPOPP '95.

[37]  J. Ramanujam,et al.  Distributed memory code generation for mixed Irregular/Regular computations , 2015, PPoPP.

[38]  Andreas Zeller,et al.  Generalized Task Parallelism , 2015, ACM Trans. Archit. Code Optim..

[39]  Christof Löding,et al.  Foundations for natural proofs and quantifier instantiation , 2017, Proc. ACM Program. Lang..

[40]  Chun Chen,et al.  Polyhedra scanning revisited , 2012, PLDI.

[41]  Katherine Yelick,et al.  Autotuning Sparse Matrix-Vector Multiplication for Multicore , 2012 .

[42]  Yunheung Paek,et al.  Efficient and precise array access analysis , 2002, TOPL.

[43]  William Pugh,et al.  Constraint-based array dependence analysis , 1998, TOPL.

[44]  Daniel Kroening,et al.  Decision Procedures - An Algorithmic Point of View , 2008, Texts in Theoretical Computer Science. An EATCS Series.

[45]  Viktor Kuncak,et al.  On Counterexample Guided Quantifier Instantiation for Synthesis in CVC4 , 2015, ArXiv.

[46]  Larry Carter,et al.  Sparse Tiling for Stationary Iterative Methods , 2004, Int. J. High Perform. Comput. Appl..

[47]  Vipin Kumar,et al.  A high performance sparse Cholesky factorization algorithm for scalable parallel computers , 1995, Proceedings Frontiers '95. The Fifth Symposium on the Frontiers of Massively Parallel Computation.

[48]  Steven Derrien,et al.  Runtime dependency analysis for loop pipelining in High-Level Synthesis , 2013, 2013 50th ACM/EDAC/IEEE Design Automation Conference (DAC).

[49]  Joel H. Saltz,et al.  Aggregation Methods for Solving Sparse Triangular Systems on Multiprocessors , 1990, SIAM J. Sci. Comput..

[50]  E. Ng,et al.  Predicting structure in nonsymmetric sparse matrix factorizations , 1993 .

[51]  Lawrence Rauchwerger,et al.  Logical inference techniques for loop parallelization , 2012, PLDI.

[52]  William Pugh,et al.  Nonlinear array dependence analysis , 1994 .

[53]  John R. Gilbert,et al.  Highly Parallel Sparse Cholesky Factorization , 1992, SIAM J. Sci. Comput..

[54]  Jennifer A. Scott,et al.  Design of a Multicore Sparse Cholesky Factorization Using DAGs , 2010, SIAM J. Sci. Comput..

[55]  Xiaotong Zhuang,et al.  Exploiting Parallelism with Dependence-Aware Scheduling , 2009, 2009 18th International Conference on Parallel Architectures and Compilation Techniques.

[56]  Leonardo Mendonça de Moura,et al.  Complete Instantiation for Quantified Formulas in Satisfiabiliby Modulo Theories , 2009, CAV.

[57]  Nancy M. Amato,et al.  A scalable method for run-time loop parallelization , 1995, International Journal of Parallel Programming.

[58]  Ulrich Rüde,et al.  Cache Optimization for Structured and Unstructured Grid Multigrid , 2000 .

[59]  Robert E. Shostak,et al.  A Practical Decision Procedure for Arithmetic with Function Symbols , 1979, JACM.

[60]  M. Papadrakakis,et al.  Accuracy and effectiveness of preconditioned conjugate gradient algorithms for large and ill-conditioned problems , 1993 .

[61]  William Pugh,et al.  The Omega Library interface guide , 1995 .

[62]  Michele Benzi,et al.  Robust Approximate Inverse Preconditioning for the Conjugate Gradient Method , 2000, SIAM J. Sci. Comput..

[63]  David A. Padua,et al.  Compiler analysis of irregular memory accesses , 2000, PLDI '00.

[64]  Olaf Schenk,et al.  Two-level dynamic scheduling in PARDISO: Improved scalability on shared memory multiprocessing systems , 2002, Parallel Comput..

[65]  Pascal Hénon,et al.  PaStiX: a high-performance parallel direct solver for sparse symmetric positive definite systems , 2002, Parallel Comput..

[66]  Alan George,et al.  Communication results for parallel sparse Cholesky factorization on a hypercube , 1989, Parallel Comput..