Analysis of classic algorithms on highly-threaded many-core architectures

Abstract The recently developed Threaded Many-core Memory (TMM) model provides a framework for analyzing algorithms for highly-threaded many-core machines such as GPUs and Cray supercomputers. In particular, it tries to capture the fact that these machines hide memory latencies via the use of a large number of threads and large memory bandwidth. The TMM model analysis contains two components: computational and memory complexity. A model is only useful if it can explain and predict empirical data. In this work, we investigate the effectiveness of the TMM model. Under this model, we analyze algorithms for 5 classic problems— suffix tree/array for string matching, fast Fourier transform, merge sort, list ranking, and all-pairs shortest paths—on a variety of GPUs. We also analyze memory access, matrix multiply and a sequence alignment algorithm on a set of Cray XMT supercomputers, the latest NVIDIA and AMD GPUs. We compare the results of the analysis with the experimental findings of ours and other researchers who have implemented and measured the performance of these algorithms on a spectrum of diverse GPUs and Cray appliances. We find that the TMM model is able to predict important, non-trivial, and sometimes previously unexplained trends and artifacts in the experimental data.

[1]  Alok Aggarwal,et al.  The input/output complexity of sorting and related problems , 1988, CACM.

[2]  Andrew S. Grimshaw,et al.  Scalable GPU graph traversal , 2012, PPoPP '12.

[3]  Ronald L. Rivest,et al.  Introduction to Algorithms , 1990 .

[4]  Weiguo Liu,et al.  Streaming Algorithms for Biological Sequence Alignment on GPUs , 2007, IEEE Transactions on Parallel and Distributed Systems.

[5]  Richard W. Vuduc,et al.  A performance analysis framework for identifying potential benefits in GPGPU applications , 2012, PPoPP '12.

[6]  Alok Aggarwal,et al.  Hierarchical memory with block transfer , 1987, 28th Annual Symposium on Foundations of Computer Science (sfcs 1987).

[7]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[8]  Lin Ma,et al.  Analysis of classic algorithms on GPUs , 2014, 2014 International Conference on High Performance Computing & Simulation (HPCS).

[9]  Shahid H. Bokhari,et al.  A comparison of the Cray XMT and XMT‐2 , 2013, Concurr. Comput. Pract. Exp..

[10]  Ramesh Subramonian,et al.  LogP: towards a realistic model of parallel computation , 1993, PPOPP '93.

[11]  James Christopher Wyllie,et al.  The Complexity of Parallel Computations , 1979 .

[12]  Larry Carter,et al.  Multi-processor Performance on the Tera MTA , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[13]  Bowen Alpern,et al.  A model for hierarchical memory , 1987, STOC.

[14]  Nuno Roma,et al.  Advantages and GPU implementation of high-performance indexed DNA search based on suffix arrays , 2011, 2011 International Conference on High Performance Computing & Simulation.

[15]  Koji Nakano,et al.  The Hierarchical Memory Machine Model for GPUs , 2013, 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum.

[16]  Lin Ma,et al.  A Memory Access Model for Highly-threaded Many-core Architectures , 2012, 2012 IEEE 18th International Conference on Parallel and Distributed Systems.

[17]  Bowen Alpern,et al.  Visualizing computer memory architectures , 1990, Proceedings of the First IEEE Conference on Visualization: Visualization `90.

[18]  Lin Ma,et al.  Theoretical analysis of classic algorithms on highly-threaded many-core GPUs , 2014, PPoPP '14.

[19]  Naga K. Govindaraju,et al.  High performance discrete Fourier transforms on graphics processors , 2008, HiPC 2008.

[20]  Lin Ma,et al.  Performance modeling for highly-threaded many-core GPUs , 2014, 2014 IEEE 25th International Conference on Application-Specific Systems, Architectures and Processors.

[21]  Leslie G. Valiant,et al.  A bridging model for parallel computation , 1990, CACM.

[22]  P J Narayanan,et al.  Fast minimum spanning tree for large graphs on the GPU , 2009, High Performance Graphics.

[23]  George C. Caragea,et al.  Models for Advancing PRAM and Other Algorithms into Parallel Programs for a PRAM-On-Chip Platform , 2006, Handbook of Parallel Computing.

[24]  Michael A. Bender,et al.  Concurrent cache-oblivious b-trees , 2005, SPAA '05.

[25]  Vijaya Ramachandran,et al.  Oblivious algorithms for multicores and network of processors , 2010, IPDPS.

[26]  Marc Moreno Maza,et al.  A Many-Core Machine Model for Designing Algorithms with Minimum Parallelism Overheads , 2014, PARCO.

[27]  Cynthia A. Phillips,et al.  Two-Level Main Memory Co-Design: Multi-threaded Algorithmic Primitives, Analysis, and Simulation , 2015, IPDPS.

[28]  Vijaya Ramachandran,et al.  Cache-efficient dynamic programming algorithms for multicores , 2008, SPAA '08.

[29]  Jeffrey Scott Vitter,et al.  Algorithms for parallel memory, I: Two-level memories , 2005, Algorithmica.

[30]  P. J. Narayanan,et al.  Some GPU Algorithms for Graph Connected Components and Spanning Tree , 2010, Parallel Process. Lett..

[31]  Allan Porterfield,et al.  The Tera computer system , 1990 .

[32]  Lin Ma,et al.  Bloom Filter Performance on Graphics Engines , 2011, 2011 International Conference on Parallel Processing.

[33]  Michael Garland,et al.  Designing efficient sorting algorithms for manycore GPUs , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[34]  Hyesoon Kim,et al.  An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness , 2009, ISCA '09.

[35]  M. Lanzagorta,et al.  Early Experience with Scientific Programs on the Cray MTA-2 , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[36]  Alfred V. Aho,et al.  The Design and Analysis of Computer Algorithms , 1974 .

[37]  Bowen Alpern,et al.  The uniform memory hierarchy model of computation , 2005, Algorithmica.

[38]  Guy E. Blelloch,et al.  Provably good multicore cache performance for divide-and-conquer algorithms , 2008, SODA '08.

[39]  Matteo Frigo,et al.  Cache-oblivious algorithms , 1999, 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039).

[40]  Eugene W. Myers,et al.  Suffix arrays: a new method for on-line string searches , 1993, SODA '90.

[41]  Yao Zhang,et al.  A quantitative performance analysis model for GPU architectures , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[42]  Jeffrey Scott Vitter,et al.  Large-Scale Sorting in Uniform Memory Hierarchies , 1993, J. Parallel Distributed Comput..

[43]  Michael T. Goodrich,et al.  Fundamental parallel algorithms for private-cache chip multiprocessors , 2008, SPAA '08.

[44]  Steven Fortune,et al.  Parallelism in random access machines , 1978, STOC.

[45]  Lin Ma,et al.  A Performance Model for Memory Bandwidth Constrained Applications on Graphics Engines , 2012, 2012 IEEE 23rd International Conference on Application-Specific Systems, Architectures and Processors.

[46]  Guy E. Blelloch,et al.  Scheduling irregular parallel computations on hierarchical caches , 2011, SPAA '11.

[47]  Ovidiu Daescu,et al.  A Parallel Algorithm Development Model for the GPU Architecture , 2012 .