Techniques for Characterizing the Data Movement Complexity of Computations

The execution cost of a program, both in terms of time and energy, comprises computational cost and data movement cost (e.g., cost of transferring data between CPU and memory devices, between parallel processors, etc.). Technology trends will cause data movement to account for the majority of energy expenditure and execution time on emerging computers. Therefore, computational complexity alone will no longer be a sufficient metric for comparing algorithms, and a fundamental characterization of data movement complexity will be increasingly important. In their seminal work, Hong & Kung proposed the red-blue pebble game to model the data movement complexity of algorithms. Using the pebble game abstraction, Hong & Kung proved tight asymptotic lower bounds for the data movement complexity of several algorithms by reformulating the problem as a graph partitioning problem. In this dissertation, we develop a novel alternate graph min-cut based lower bounding technique. Using our technique, we derive tight lower bounds for different algorithms, with upper bounds matching within a constant factor. Further, we develop a dynamic analysis based automated heuristic for our technique, which enables automatic analysis of arbitrary computations. We provide several use cases for our automated approach. This dissertation also presents a technique, built upon the ideas of Christ et al. [15], to derive asymptotic parametric lower bounds for a sub-class of computations, called ii affine computations. A static analysis based heuristic to automatically derive parametric lower bounds for affine parts of the computations is also presented. Motivated by the emerging interest in large scale parallel systems with interconnection networks and hierarchical caches with varying bandwidths at different levels, we extend the pebble game model to parallel system architecture to characterize the data movement requirements in large scale parallel computers. We provide interesting insights on architectural bottlenecks that limit the performance of algorithms on these parallel machines. Finally, using data movement complexity analysis, in conjunction with the roofline model for performance bounds, we perform an algorithm-architecture codesign exploration across an architectural design space. We model the maximal achievable performance and energy efficiency of different algorithms for a given VLSI technology, considering different architectural parameters.

[1]  Mohammad Zubair,et al.  A unified model for multicore architectures , 2008, IFMT '08.

[2]  James Demmel,et al.  Perfect Strong Scaling Using No Additional Energy , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[3]  Gianfranco Bilardi,et al.  A Characterization of Temporal Locality and Its Portability across Memory Hierarchies , 2001, ICALP.

[4]  James Demmel,et al.  Communication-optimal Parallel and Sequential QR and LU Factorizations , 2008, SIAM J. Sci. Comput..

[5]  Jung Ho Ahn,et al.  McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[6]  David Parello,et al.  Semi-Automatic Composition of Loop Transformations for Deep Parallelism and Memory Hierarchies , 2006, International Journal of Parallel Programming.

[7]  Samuel H. Fuller,et al.  The Future of Computing Performance: Game Over or Next Level? , 2014 .

[8]  Sven Verdoolaege,et al.  isl: An Integer Set Library for the Polyhedral Model , 2010, ICMS.

[9]  Mohammad Zubair,et al.  Cache-optimal algorithms for option pricing , 2010, TOMS.

[10]  John E. Savage Extending the Hong-Kung Model to Memory Hierarchies , 1995, COCOON.

[11]  M. Hestenes,et al.  Methods of conjugate gradients for solving linear systems , 1952 .

[12]  John Shalf,et al.  Exascale Computing Technology Challenges , 2010, VECPAR.

[13]  A. I. Barvinok,et al.  Computing the Ehrhart polynomial of a convex lattice polytope , 1994, Discret. Comput. Geom..

[14]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[15]  Dror Irony,et al.  Communication lower bounds for distributed-memory matrix multiplication , 2004, J. Parallel Distributed Comput..

[16]  Philippe Clauss,et al.  Profiling Data-Dependence to Assist Parallelization: Framework, Scope, and Optimization , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[17]  James Demmel,et al.  Minimizing Communication in Numerical Linear Algebra , 2009, SIAM J. Matrix Anal. Appl..

[18]  Bowen Alpern,et al.  A model for hierarchical memory , 1987, STOC.

[19]  Alok Aggarwal,et al.  The input/output complexity of sorting and related problems , 1988, CACM.

[20]  P. Feautrier Parametric integer programming , 1988 .

[21]  J. Ramanujam,et al.  On characterizing the data movement complexity of computational DAGs for parallel execution , 2014, SPAA.

[22]  Leslie G. Valiant A Bridging Model for Multi-core Computing , 2008, ESA.

[23]  H. T. Kung,et al.  I/O complexity: The red-blue pebble game , 1981, STOC '81.

[24]  T. Tao,et al.  Finite bounds for Hölder-Brascamp-Lieb multilinear inequalities , 2005, math/0505691.

[25]  Sanjay V. Rajopadhye,et al.  The Z-polyhedral model , 2007, PPOPP.

[26]  H. Whitney,et al.  An inequality related to the isoperimetric inequality , 1949 .

[27]  J. Ramanujam,et al.  On Using the Roofline Model with Lower Bounds on Data Movement , 2015, ACM Trans. Archit. Code Optim..

[28]  Franco P. Preparata,et al.  Processor—Time Tradeoffs under Bounded-Speed Message Propagation: Part II, Lower Bounds , 1999, Theory of Computing Systems.

[29]  J. Ramanujam,et al.  On Characterizing the Data Access Complexity of Programs , 2014, POPL.

[30]  James Demmel,et al.  Tradeoffs between synchronization, communication, and computation in parallel linear algebra computations , 2014, SPAA.

[31]  Michele Scquizzato,et al.  Communication Lower Bounds for Distributed-Memory Computations , 2013, STACS.

[32]  James Demmel,et al.  Graph expansion and communication costs of fast matrix multiplication: regular submission , 2011, SPAA '11.

[33]  Richard W. Vuduc,et al.  A Roofline Model of Energy , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[34]  Vipin Kumar,et al.  A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs , 1998, SIAM J. Sci. Comput..

[35]  Franco P. Preparata,et al.  Upper bounds to processor-time tradeoffs under bounded-speed message propagation , 1995, SPAA '95.

[36]  Stefan Rusu,et al.  A 45nm 8-core enterprise Xeon ® processor , 2009 .

[37]  Desh Ranjan,et al.  Upper and lower I/O bounds for pebbling r-pyramids , 2012, J. Discrete Algorithms.

[38]  Samuel Williams,et al.  Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.

[39]  Andrea Pietracaprina,et al.  On the Space and Access Complexity of Computation DAGs , 2000, WG.

[40]  Richard W. Vuduc,et al.  Balance Principles for Algorithm-Architecture Co-Design , 2011, HotPar.

[41]  James Demmel,et al.  Communication lower bounds and optimal algorithms for programs that reference arrays - Part 1 , 2013, ArXiv.

[42]  Ragavendar Nagapattinam Ramamurthi Dynamic Trace-based Analysis of Vectorization Potential of Programs , 2012 .

[43]  James Demmel,et al.  Minimizing Communication in All-Pairs Shortest Paths , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[44]  Geppino Pucci,et al.  The Potential of On-Chip Multiprocessing for QCD Machines , 2005, HiPC.

[45]  Desh Ranjan,et al.  Strong I/O Lower Bounds for Binomial and FFT Computation Graphs , 2011, COCOON.

[46]  Norman P. Jouppi,et al.  CACTI 6.0: A Tool to Model Large Caches , 2009 .

[47]  Richard W. Vuduc,et al.  Algorithmic Time, Energy, and Power on Candidate HPC Compute Building Blocks , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.