Efficient Representation Scheme for Multidimensional Array Operations

Array operations are used in a large number of important scientific codes. To implement these array operations efficiently, many methods have been proposed in the literature, most of which are focused on two-dimensional arrays. When extended to higher dimensional arrays, these methods usually do not perform well. Hence, designing efficient algorithms for multidimensional array operations becomes an important issue. We propose a new scheme, extended Karnaugh map representation (EKMR), for the multidimensional array representation. The main idea of the EKMR scheme is to represent a multidimensional array by a set of two-dimensional arrays. Hence, efficient algorithm design for multidimensional array operations becomes less complicated. To evaluate the proposed scheme, we design efficient algorithms for multidimensional array operations, matrix-matrix addition/subtraction and matrix-matrix multiplications, based on the EKMR and the traditional matrix representation (TMR) schemes. Theoretical and experimental tests for these array operations were conducted. In the experimental test, we compare the performance of intrinsic functions provided by the Fortran 90 compiler with those based on the EKMR scheme. The experimental results show that the algorithms based on the EKMR scheme outperform those based on the TMR scheme and those provided by the Fortran 90 compiler.

[1]  Chau-Wen Tseng,et al.  Compiler optimizations for improving data locality , 1994, ASPLOS VI.

[2]  Keshav Pingali,et al.  A Singular Loop Transformation Framework Based on Non-Singular Matrices , 1992, LCPC.

[3]  Emilio L. Zapata,et al.  Modeling set associative caches behavior for irregular computations , 1998, SIGMETRICS '98/PERFORMANCE '98.

[4]  Keshav Pingali,et al.  Compiling Parallel Sparse Code for User-Defined Data Structures , 1997, PPSC.

[5]  Mithuna Thottethodi,et al.  Recursive array layouts and fast parallel matrix multiplication , 1999, SPAA '99.

[6]  Mithuna Thottethodi,et al.  Nonlinear array layouts for hierarchical memory systems , 1999, ICS '99.

[7]  Chau-Wen Tseng,et al.  Improving data locality with loop transformations , 1996, TOPL.

[8]  Jeremy D. Frens,et al.  Auto-blocking matrix-multiplication or tracking BLAS3 performance from source code , 1997, PPOPP '97.

[9]  J. Cullum,et al.  Lanczos algorithms for large symmetric eigenvalue computations , 1985 .

[10]  Mithuna Thottethodi,et al.  Tuning Strassen's Matrix Multiplication for Memory Efficiency , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[11]  Yeh-Ching Chung,et al.  Efficient parallel algorithms for multi-dimensional matrix operations , 2000, Proceedings International Symposium on Parallel Architectures, Algorithms and Networks. I-SPAN 2000.

[12]  Bharat Kumar,et al.  A tensor product formulation of Strassen's matrix multiplication algorithm with memory reduction , 1995 .

[13]  Wei Li,et al.  Unifying data and control transformations for distributed shared-memory machines , 1995, PLDI '95.

[14]  Boleslaw K. Szymanski,et al.  Run-Time Optimization of Sparse Matrix-Vector Multiplication on SIMD Machines , 1994, PARLE.

[15]  Monica S. Lam,et al.  A data locality optimizing algorithm , 1991, PLDI '91.

[16]  Joel H. Saltz,et al.  Parallelization Techniques for Sparse Matrix Applications , 1996, J. Parallel Distributed Comput..

[17]  Mahmut T. Kandemir,et al.  A compiler algorithm for optimizing locality in loop nests , 1997, ICS '97.

[18]  Ioana Banicescu,et al.  Balancing Processor Loads and Exploiting Data Locality in N-Body Simulations , 1995, SC.

[19]  Michael F. P. O'Boyle,et al.  Integrating loop and data transformations for global optimisation , 1998, Proceedings. 1998 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.98EX192).

[20]  K. Pingali,et al.  Compiling Parallel Code for Sparse Matrix Applications , 1997, ACM/IEEE SC 1997 Conference (SC'97).

[21]  Emilio L. Zapata,et al.  Cache Misses Prediction for High Performance Sparse Algorithms , 1998, Euro-Par.

[22]  Michael E. Wolf,et al.  The cache performance and optimizations of blocked algorithms , 1991, ASPLOS IV.

[23]  Kathryn S. McKinley,et al.  Tile size selection using cache organization and data layout , 1995, PLDI '95.

[24]  Keshav Pingali,et al.  A Relational Approach to the Compilation of Sparse Matrix Programs , 1997, Euro-Par.

[25]  Kanad Ghose,et al.  Caching-efficient multithreaded fast multiplication of sparse matrices , 1998, Proceedings of the First Merged International Parallel Processing Symposium and Symposium on Parallel and Distributed Processing.

[26]  M. P. Levin,et al.  Numerical Recipes In Fortran 90: The Art Of Parallel Scientific Computing , 1998, IEEE Concurrency.

[27]  Emilio L. Zapata,et al.  Automatic analytical modeling for the estimation of cache misses , 1999, 1999 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.PR00425).

[28]  Larry Carter,et al.  Hierarchical tiling for improved superscalar performance , 1995, Proceedings of 9th International Parallel Processing Symposium.

[29]  Emilio L. Zapata,et al.  Cache probabilistic modeling for basic sparse algebra kernels involving matrices with a non-uniform distribution , 1998, Proceedings. 24th EUROMICRO Conference (Cat. No.98EX204).

[30]  C.W. Kessler,et al.  The SPARAMAT approach to automatic comprehension of sparse matrix computations , 1999, Proceedings Seventh International Workshop on Program Comprehension.

[31]  Mahmut T. Kandemir,et al.  Improving Cache Locality by a Combination of Loop and Data Transformation , 1999, IEEE Trans. Computers.

[32]  Michael F. P. O'Boyle,et al.  Integrating Loop and Data Transformations for Global Optimization , 2002, J. Parallel Distributed Comput..

[33]  Ken Kennedy,et al.  Improving register allocation for subscripted variables , 1990, SIGP.

[34]  Jerrold L. Wagener,et al.  Fortran 90 Handbook: Complete Ansi/Iso Reference , 1992 .