Communication-Optimal Tilings for Projective Nested Loops with Arbitrary Bounds

Abstract Reducing communication - either between levels of a memory hierarchy or between processors over a network - is a key component of performance optimization (in both time and energy) for many nested loop problems, including dense linear algebra, particle interactions, and machine learning. Previous tiling based approaches for these problems have been used to find both lower bounds on the communication required to execute them and optimal rearrangements, or blockings, to attain such lower bounds. However, such general approaches have typically assumed the problem sizes are large, an assumption that is often not met in practice. In this paper, we provide an efficient way to both find and obtain, via an appropriate, efficiently constructible blocking, communication lower bounds and matching tilings which attain these lower bounds for nested loop programs with arbitrary loop bounds that operate on multidimensional arrays in the projective case, where the array indices are subsets of the loop indices. Our approach works on all such problems, regardless of dimensionality, size, memory access patterns, or number of arrays.

[1]  H. T. Kung,et al.  I/O complexity: The red-blue pebble game , 1981, STOC '81.

[2]  M. Morari,et al.  Geometric Algorithm for Multiparametric Linear Programming , 2003 .

[3]  Dror Irony,et al.  Communication lower bounds for distributed-memory matrix multiplication , 2004, J. Parallel Distributed Comput..

[4]  T. Tao,et al.  Finite bounds for Hölder-Brascamp-Lieb multilinear inequalities , 2005, math/0505691.

[5]  Stefán Ingi Valdimarsson The Brascamp–Lieb Polyhedron , 2010, Canadian Journal of Mathematics.

[6]  Katherine A. Yelick,et al.  A Communication-Optimal N-Body Algorithm for Direct Interactions , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[7]  James Demmel,et al.  Communication lower bounds and optimal algorithms for programs that reference arrays - Part 1 , 2013, ArXiv.

[8]  James Demmel,et al.  Communication lower bounds and optimal algorithms for numerical linear algebra*† , 2014, Acta Numerica.

[9]  Nicholas Knight,et al.  Communication-Optimal Loop Nests , 2015 .

[10]  James Demmel,et al.  Parallelepipeds obtaining HBL lower bounds , 2016, ArXiv.

[11]  Avi Wigderson,et al.  Algorithmic aspects of Brascamp-Lieb inequalities , 2016, ArXiv.

[12]  Bo Chen,et al.  MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications , 2017, ArXiv.

[13]  Jack J. Dongarra,et al.  The Design and Performance of Batched BLAS on Modern High-Performance Computing Systems , 2017, ICCS.

[14]  Geoffrey E. Hinton,et al.  Matrix capsules with EM routing , 2018, ICLR.

[15]  Alexander Heinecke,et al.  Anatomy of High-Performance Deep Learning Convolutions on SIMD Architectures , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.

[16]  James Demmel,et al.  Communication-Optimal Convolutional Neural Nets , 2018, ArXiv.

[17]  Paul Barham,et al.  Machine Learning Systems are Stuck in a Rut , 2019, HotOS.

[18]  Julien Langou,et al.  Automated derivation of parametric data movement lower bounds for affine programs , 2019, PLDI.

[19]  E. Callahan Berkeley , 2021, British Journal for the History of Philosophy.