Layout-oblivious compiler optimization for matrix computations

Most scientific computations serve to apply mathematical operations to a set of preconceived data structures, e.g., matrices, vectors, and grids. In this article, we use a number of widely used matrix computations from the LINPACK library to demonstrate that complex internal organizations of data structures can severely degrade the effectiveness of compiler optimizations. We then present a data-layout-oblivious optimization methodology, where by isolating an abstract representation of the computations from complex implementation details of their data, we enable these computations to be much more accurately analyzed and optimized through varying state-of-the-art compiler technologies. We evaluated our approach on an Intel 8-core platform using two source-to-source compiler infrastructures, Pluto and EPOD. Our results show that while the efficiency of a computational kernel differs when using different data layouts, the alternative implementations typically benefit from a common set of optimizations on the operations. Therefore separately optimizing the operations and the data layout of a computation could dramatically enhance the effectiveness of compiler optimizations compared with the conventional approaches of using a unified representation.

[1]  Gang Ren,et al.  A comparison of empirical and model-driven optimization , 2003, PLDI '03.

[2]  James Demmel,et al.  Benchmarking GPUs to tune dense linear algebra , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[3]  Yang Yang,et al.  Automatic Library Generation for BLAS3 on GPUs , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[4]  Xiaoning Ding,et al.  ULCC: a user-level facility for optimizing shared cache performance on multicores , 2011, PPoPP '11.

[5]  Corina S. Pasareanu,et al.  A survey of new trends in symbolic execution for software testing and analysis , 2009, International Journal on Software Tools for Technology Transfer.

[6]  James Demmel,et al.  Benchmarking GPUs to tune dense linear algebra , 2008, HiPC 2008.

[7]  Jack J. Dongarra,et al.  Automated empirical optimizations of software and the ATLAS project , 2001, Parallel Comput..

[8]  Kleanthis Psarris,et al.  Enhancing the Role of Inlining in Effective Interprocedural Parallelization , 2011, 2011 International Conference on Parallel Processing.

[9]  Minyi Guo,et al.  Enabling loop fusion and tiling for cache performance by fixing fusion-preventing data dependences , 2005, 2005 International Conference on Parallel Processing (ICPP'05).

[10]  Dongrui Fan,et al.  Extendable pattern-oriented optimization directives , 2012, International Symposium on Code Generation and Optimization (CGO 2011).

[11]  Uday Bondhugula,et al.  A practical automatic polyhedral parallelizer and locality optimizer , 2008, PLDI '08.

[12]  Kurt S. Riedel,et al.  Banded matrix fraction representation of triangular input normal pairs , 2001, IEEE Trans. Autom. Control..

[13]  Jack J. Dongarra,et al.  A set of level 3 basic linear algebra subprograms , 1990, TOMS.

[14]  Ramesh C. Agarwal,et al.  A Scalable Parallel Block Algorithm for Band Cholesky Factorization , 1995, PPSC.

[15]  Ken Kennedy,et al.  Optimizing Compilers for Modern Architectures: A Dependence-based Approach , 2001 .

[16]  Chun Chen,et al.  Combining models and guided empirical search to optimize for multiple levels of the memory hierarchy , 2005, International Symposium on Code Generation and Optimization.

[17]  Lori A. Clarke,et al.  A program testing system , 1976, ACM '76.

[18]  Rudolf Eigenmann,et al.  Idiom recognition in the Polaris parallelizing compiler , 1995, ICS '95.

[19]  William M. Pottenger Induction Variable Substitution And Reduction Recognition In The Polaris Parallelizing Compiler , 1995 .

[20]  Erik Elmroth,et al.  SIAM REVIEW c ○ 2004 Society for Industrial and Applied Mathematics Vol. 46, No. 1, pp. 3–45 Recursive Blocked Algorithms and Hybrid Data Structures for Dense Matrix Library Software ∗ , 2022 .

[21]  Allen,et al.  Optimizing Compilers for Modern Architectures , 2004 .

[22]  Fred G. Gustavson,et al.  A recursive formulation of Cholesky factorization of a matrix in packed storage , 2001, TOMS.

[23]  Qing Yi,et al.  Automated programmable control and parameterization of compiler optimizations , 2011, International Symposium on Code Generation and Optimization (CGO 2011).

[24]  Thomas E. Cheatham,et al.  Symbolic Evaluation and the Analysis of Programs , 1979, IEEE Transactions on Software Engineering.

[25]  Keshav Pingali,et al.  The tao of parallelism in algorithms , 2011, PLDI '11.

[26]  Markus Mock,et al.  DyC: an expressive annotation-directed dynamic compiler for C , 2000, Theor. Comput. Sci..

[27]  James Demmel,et al.  Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology , 1997, ICS '97.

[28]  Dongrui Fan,et al.  Extendable pattern-oriented optimization directives , 2011, CGO 2011.

[29]  Qing Yi,et al.  POET: a scripting language for applying parameterized source‐to‐source program transformations , 2012, Softw. Pract. Exp..

[30]  David A. Padua,et al.  Semantic Inlining - the Compiler Support for Java in Technical Computing , 1999, PPSC.

[31]  Yuefan Deng,et al.  New trends in high performance computing , 2001, Parallel Computing.

[32]  Jingling Xue,et al.  Loop Tiling for Parallelism , 2000, Kluwer International Series in Engineering and Computer Science.

[33]  Calvin Lin,et al.  Broadway: A Compiler for Exploiting the Domain-Specific Semantics of Software Libraries , 2005, Proceedings of the IEEE.

[34]  Guang R. Gao,et al.  Optimization of Dense Matrix Multiplication on IBM Cyclops-64: Challenges and Experiences , 2006, Euro-Par.

[35]  Murali Sitaraman,et al.  A Data Abstraction Alternative to Data Structure/Algorithm Modularization , 1998, Generic Programming.

[36]  Julien Langou,et al.  Rectangular full packed format for cholesky's algorithm: factorization, solution, and inversion , 2009, TOMS.