Balance Principles for Algorithm-Architecture Co-Design

We consider the problem of "co-design," by which we mean the problem of how to design computational algorithms for particular hardware architectures and vice-versa. Our position is that balance principles should drive the co-design process. A balance principle is a theoretical constraint equation that explicitly relates algorithm parameters to hardware parameters according to some figure of merit, such as speed, power, or cost. This notion originates in the work of Kung (1986); Callahan, Cocke, and Kennedy (1988); and McCalpin (1995); however, we reinterpret these classical notions of balance in a modern context of parallel and I/O-efficient algorithm design as well as trends in emerging architectures. From such a principle, we argue that one can better understand algorithm and hardware trends, and furthermore gain insight into how to improve both algorithms and hardware. For example, we suggest that although matrix multiply is currently compute-bound, it will in fact become memory-bound in as few as ten years--even if last-level caches grow at their current rates. Our overall aim is to suggest how to co-design rigorously and quantitatively while still yielding intuition and insight.

[1]  PattersonDavid,et al.  LogP: towards a realistic model of parallel computation , 1993 .

[2]  Dror Irony,et al.  Communication lower bounds for distributed-memory matrix multiplication , 2004, J. Parallel Distributed Comput..

[3]  Guy E. Blelloch,et al.  Scheduling threads for constructive cache sharing on CMPs , 2007, SPAA '07.

[4]  Gul A. Agha,et al.  Analysis of Parallel Algorithms for Energy Conservation in Scalable Multicore Architectures , 2009, 2009 International Conference on Parallel Processing.

[5]  Samuel Williams,et al.  The Landscape of Parallel Computing Research: A View from Berkeley , 2006 .

[6]  Edward D. Lazowska,et al.  Quantitative System Performance , 1985, Int. CMG Conference.

[7]  G. Amdhal,et al.  Validity of the single processor approach to achieving large scale computing capabilities , 1967, AFIPS '67 (Spring).

[8]  David Patterson,et al.  The Top 10 Innovations in the New NVIDIA Fermi Architecture, and the Top 3 Next Challenges , 2009 .

[9]  Ken Kennedy,et al.  Estimating Interlock and Improving Balance for Pipelined Architectures , 1988, J. Parallel Distributed Comput..

[10]  James Demmel,et al.  Minimizing Communication in Linear Algebra , 2009, ArXiv.

[11]  James Demmel,et al.  Modeling the benefits of mixed data and task parallelism , 1995, SPAA '95.

[12]  Alok Aggarwal,et al.  The input/output complexity of sorting and related problems , 1988, CACM.

[13]  H. T. Kung,et al.  I/O complexity: The red-blue pebble game , 1981, STOC '81.

[14]  Guy E. Blelloch,et al.  Programming parallel algorithms , 1996, CACM.

[15]  Richard W. Vuduc,et al.  Model-driven autotuning of sparse matrix-vector multiply on GPUs , 2010, PPoPP '10.

[16]  William Gropp,et al.  An introductory exascale feasibility study for FFTs and multigrid , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[17]  H. T. Kung Memory requirements for balanced computer architectures , 1986, ISCA '86.

[18]  Tom R. Halfhill NVIDIA's Next-Generation CUDA Compute and Graphics Architecture, Code-Named Fermi, Adds Muscle for Parallel Processing , 2009 .

[19]  Leslie G. Valiant,et al.  A bridging model for parallel computation , 1990, CACM.

[20]  Mark D. Hill,et al.  Amdahl's Law in the Multicore Era , 2008 .

[21]  Keshav Pingali,et al.  An experimental comparison of cache-oblivious and cache-conscious programs , 2007, SPAA '07.

[22]  Gabriel H. Loh,et al.  3D-Stacked Memory Architectures for Multi-core Processors , 2008, 2008 International Symposium on Computer Architecture.

[23]  Yao Zhang,et al.  A quantitative performance analysis model for GPU architectures , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[24]  Ramesh Subramonian,et al.  LogP: towards a realistic model of parallel computation , 1993, PPOPP '93.

[25]  Guy E. Blelloch,et al.  Low depth cache-oblivious algorithms , 2010, SPAA '10.

[26]  Leslie G. Valiant A Bridging Model for Multi-core Computing , 2008, ESA.

[27]  Hyesoon Kim,et al.  An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness , 2009, ISCA '09.

[28]  Edward D. Lazowska,et al.  Quantitative system performance - computer system analysis using queueing network models , 1983, Int. CMG Conference.

[29]  KimHyesoon,et al.  An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness , 2009 .

[30]  Constantine Bekas,et al.  A new energy aware performance metric , 2010, Computer Science - Research and Development.

[31]  James Reinders,et al.  Intel® threading building blocks , 2008 .

[32]  Matteo Frigo,et al.  Reducers and other Cilk++ hyperobjects , 2009, SPAA '09.

[33]  M. Greenstreet,et al.  An Energy Aware Model of Computation , 2008 .

[34]  Guy E. Blelloch,et al.  The data locality of work stealing , 2000, SPAA.

[35]  Bradley C. Kuszmaul,et al.  Cilk: an efficient multithreaded runtime system , 1995, PPOPP '95.

[36]  Robert W. Numrich,et al.  A metric space for computer programs and the principle of computational least action , 2008, The Journal of Supercomputing.

[37]  Samuel Williams,et al.  Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.

[38]  Alain J. Martin Towards an energy complexity of computation , 2001, Inf. Process. Lett..

[39]  Richard P. Brent,et al.  The Parallel Evaluation of General Arithmetic Expressions , 1974, JACM.

[40]  Ravi Jain,et al.  Towards a model of energy complexity for algorithms [mobile wireless applications] , 2005, IEEE Wireless Communications and Networking Conference, 2005.

[41]  Hsien-Hsin S. Lee,et al.  Extending Amdahl's Law for Energy-Efficient Computing in the Many-Core Era , 2008, Computer.

[42]  Charles E. Leiserson,et al.  Cache-Oblivious Algorithms , 2003, CIAC.

[43]  David A. Patterson,et al.  Latency lags bandwith , 2004, CACM.