Morton-order Matrices Deserve Compilers ’ Support Technical Report 533

A proof of concept is offered for the uniform representation of matrices serially in Morton-order (or Z-order) representation, as well as their divide-and-conquer processing as quaternary trees. Generally, d dimensional arrays are accessed as 2-ary trees. This data structure is important because, at once, it relaxes serious problems of locality and latency, while the tree helps schedule multi-processing. It enables algorithms that avoid cache misses and page faults at all levels in hierarchical memory, independently of a specific runtime environment. This paper gathers the properties of Morton order and its mappings to other indexings, and outlines for compiler support of it. Statistics on matrix multiplication, a critical example, show how the new ordering and block algorithms achieve high flop rates and, indirectly, parallelism without low-level tuning. Perhaps because of the early success of column-major representation with strength reduction, quadtree representation has been reinvented and redeveloped in areas far from the center that is Programming Languages. As target architectures move to multiprocessing, super-scalar pipes, and hierarchical memories, compilers must support quadtrees better, so that more programmers invent algorithms that use them to exploit the hardware. CCS Categories and subject descriptors:E.1 [Data Structures]: Arrays; D.3.2 [Programming Languages]: Language Classifications—concurrent, distributed and parallel languages; applicative (functional) languages; D.4.2 [Operating Systems]: Storage management—storage hierarchies; E.2 [Data Storage Representations]: contiguous representations; F.2.1 [Analysis of Algorithms and Problem Complexity]: Numerical algorithms and problems— computations on matrices. General Terms: Design, Performance. Additional

[1]  G. Peano Sur une courbe, qui remplit toute une aire plane , 1890 .

[2]  K. D. Tocher The Application of Automatic Computers to Sampling Experiments , 1954 .

[3]  Edward G. Coffman,et al.  Organizing matrices and matrix operations for paged memory systems , 1969, Commun. ACM.

[4]  V. Strassen Gaussian elimination is not optimal , 1969 .

[5]  J. W. Backus,et al.  Can programming be liberated from the von Neumann style , 1977 .

[6]  John Darlington,et al.  A Transformation System for Developing Recursive Programs , 1977, J. ACM.

[7]  Donald E. Knuth,et al.  The art of computer programming: V.1.: Fundamental algorithms , 1997 .

[8]  John W. Backus,et al.  The history of FORTRAN I, II, and III , 1978, SIGP.

[9]  Patrick C. Fischer,et al.  Storage reorganization techniques for matrix computation in a paging environment , 1979, CACM.

[10]  Irene Gargantini,et al.  An effective way to represent quadtrees , 1982, CACM.

[11]  F. Warren Burton,et al.  Comment on 'the explicit quad tree as a structure for computer graphics , 1983 .

[12]  T. H. Merrett,et al.  A class of data structures for associative searching , 1984, PODS.

[13]  F. Warren Burton,et al.  Real-Time Raster to Quadtree and Quadtree to Raster Conversion Algorithms with Modest Storage Requirements , 1986, Angew. Inform..

[14]  Jack J. Dongarra,et al.  An extended set of FORTRAN basic linear algebra subprograms , 1988, TOMS.

[15]  Paul Hudak,et al.  Compilation of Haskell array comprehensions for scientific computing , 1990, PLDI '90.

[16]  Nicholas J. Higham,et al.  Exploiting fast matrix multiplication within the level 3 BLAS , 1990, TOMS.

[17]  Günther F. Schrack,et al.  Finding neighbors of equal size in linear quadtrees and octrees in constant time , 1991, CVGIP Image Underst..

[18]  Peter H. Beckman Parallel LU decomposition for sparse matrices using quadtrees on a shared-heap multiprocessor , 1993 .

[19]  Gary Newman,et al.  Organizing arrays for paged memory systems , 1995, CACM.

[20]  Wei Li,et al.  Unifying data and control transformations for distributed shared-memory machines , 1995, PLDI '95.

[21]  X. Zhang,et al.  Experiences of parallelising finite‐element problems in a functional style , 1995, Softw. Pract. Exp..

[22]  Monica S. Lam,et al.  Data and computation transformations for multiprocessors , 1995, PPOPP '95.

[23]  Ramesh Subramonian,et al.  LogP: a practical model of parallel computation , 1996, CACM.

[24]  Jeremy D. Frens,et al.  Auto-blocking matrix-multiplication or tracking BLAS3 performance from source code , 1997, PPOPP '97.

[25]  Keshav Pingali,et al.  Data-centric multi-level blocking , 1997, PLDI '97.

[26]  Steve Carr,et al.  Compiler blockability of dense matrix factorizations , 1997, TOMS.

[27]  Shang-Hua Teng,et al.  High performance Fortran for highly irregular problems , 1997, PPOPP '97.

[28]  Fred G. Gustavson,et al.  Recursion leads to automatic variable blocking for dense linear-algebra algorithms , 1997, IBM J. Res. Dev..

[29]  Sivan Toledo Locality of Reference in LU Decomposition with Partial Pivoting , 1997, SIAM J. Matrix Anal. Appl..

[30]  Matteo Frigo,et al.  The implementation of the Cilk-5 multithreaded language , 1998, PLDI.

[31]  Jack J. Dongarra,et al.  Automatically Tuned Linear Algebra Software , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[32]  Mithuna Thottethodi,et al.  Tuning Strassen's Matrix Multiplication for Memory Efficiency , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[33]  Mithuna Thottethodi,et al.  Nonlinear array layouts for hierarchical memory systems , 1999, ICS '99.

[34]  David S. Wise Undulant-Block Elimination and Integer-Preserving Matrix Inversion , 1999, Sci. Comput. Program..

[35]  Mithuna Thottethodi,et al.  Recursive array layouts and fast parallel matrix multiplication , 1999, SPAA '99.

[36]  Hanan Samet,et al.  Navigating through triangle meshes implemented as linear quadtrees , 2000, TOGS.