Better speedups using simpler parallel programming for graph connectivity and biconnectivity

Speedups demonstrated for finding the biconnected components of a graph: 9x to 33x on the Explicit Multi-Threading (XMT) many-core computing platform relative to the best serial algorithm using a relatively modest silicon budget. Further evidence suggests that speedups of 21x to 48x are possible. For graph connectivity, we demonstrate that XMT outperforms two recent NVIDIA GPUs of similar or greater silicon area. Previous studies of parallel biconnectivity algorithms achieved at most a 4x speedup, but we could not find biconnectivity code for GPUs to compare biconnectivity against them. Ease-of-programming: The paper suggests that parallel programming for the XMT platform is considerably simpler than for the SMP and GPU ones. Unlike the quantitative speedup results, the ease-of-programming comparison is more qualitative. Productivity of parallel programming is a central interest of PMAM/PPoPP strongly favoring ease-of-programming. We believe that the discussion is on par with the state of the art on this relatively underexplored interest. The results provide new insights into the synergy between algorithms, the practice of parallel programming and architecture: (1) no single biconnectivity algorithm is dominant for all inputs; (2) XMT provides good performance for each algorithm and better speedups relative to other platforms; (3) the textbook (TV) PRAM algorithm was the only one that provided strong speedups on XMT across all inputs considered; and (4) the TV implementation was a direct implementation of a PRAM algorithm, though a nontrivial effort was needed to get a PRAM version with lower constant factors. Overall, it appears that previous low speedups on other platforms were not caused by inefficient algorithms or their programming. Instead, it is because of the better match between the algorithms and the XMT platform. Given the growing interest in adding architectural support for parallel programming to existing multi-cores, our results suggest the following open question: can such added architectural support catch up on speedups and ease-of-programming with a design originally inspired by parallel algorithms, such as XMT? Finally, this work addresses another related interest of PMAM/PPoPP: new parallel workloads that improve synergy with emerging architectures. One variant of biconnectivity algorithms demonstrated the potential advantage of enhancing XMT by supporting in hardware more thread contexts, perhaps through context switching between them--apparently, a first demonstration of this old Cray MTA concept benefiting XMT.

[1]  Fuat Keceli,et al.  Toolchain for Programming, Simulating and Studying the XMT Many-Core Architecture , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[2]  Charalampos E. Tsourakakis,et al.  HADI : Fast Diameter Estimation and Mining in Massive Graphs with Hadoop , 2008 .

[3]  John H. Reif,et al.  Depth-First Search is Inherently Sequential , 1985, Inf. Process. Lett..

[4]  Guy E. Blelloch,et al.  The hidden cost of low bandwidth communication , 1994 .

[5]  David A. Bader,et al.  An experimental study of parallel biconnected components algorithms on symmetric multiprocessors (SMPs) , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[6]  P. J. Narayanan,et al.  Some GPU Algorithms for Graph Connected Components and Spanning Tree , 2010, Parallel Process. Lett..

[7]  Michael Gerndt,et al.  Analyzing Overheads and Scalability Characteristics of OpenMP Applications , 2006, VECPAR.

[8]  Jie Wu,et al.  NSF/IEEE-TCPP curriculum initiative on parallel and distributed computing: core topics for undergraduates , 2011, SIGCSE '11.

[9]  Robert E. Tarjan,et al.  An Efficient Parallel Biconnectivity Algorithm , 2011, SIAM J. Comput..

[10]  Uzi Vishkin,et al.  Fpga-based prototype of a pram-on-chip processor , 2008, CF '08.

[11]  Gang Qu,et al.  An area-efficient high-throughput hybrid interconnection network for single-chip parallel processing , 2008, 2008 45th ACM/IEEE Design Automation Conference.

[12]  Sergei Vassilvitskii,et al.  A model of computation for MapReduce , 2010, SODA '10.

[13]  VishkinUzi,et al.  An O(n2 log n) parallel max-flow algorithm , 1982 .

[14]  Fuat Keceli,et al.  Power-Performance Comparison of Single-Task Driven Many-Cores , 2011, 2011 IEEE 17th International Conference on Parallel and Distributed Systems.

[15]  Uzi Vishkin,et al.  An O(n² log n) Parallel MAX-FLOW Algorithm , 1982, J. Algorithms.

[16]  Erik Lindholm,et al.  NVIDIA Tesla: A Unified Graphics and Computing Architecture , 2008, IEEE Micro.

[17]  Uzi Vishkin,et al.  An O(log n) Parallel Connectivity Algorithm , 1982, J. Algorithms.

[18]  Hugh Garraway Parallel Computer Architecture: A Hardware/Software Approach , 1999, IEEE Concurrency.

[19]  Denise Marie Eckstein Parallel graph processing using depth-first search and breadth-first search. , 1977 .

[20]  Uzi Vishkin,et al.  Thinking in Parallel: Some Basic Data-Parallel Algorithms and Techniques , 2008 .

[21]  Uzi Vishkin,et al.  Trade-offs between depth and width in parallel computation , 1983, 24th Annual Symposium on Foundations of Computer Science (sfcs 1983).

[22]  W. Marsden I and J , 2012 .

[23]  William J. Dally,et al.  The GPU Computing Era , 2010, IEEE Micro.

[24]  George C. Caragea,et al.  General-Purpose vs . GPU : Comparison of Many-Cores on Irregular Workloads , 2010 .

[25]  Gang Qu,et al.  Layout-Accurate Design and Implementation of a High-Throughput Interconnection Network for Single-Chip Parallel Processing , 2007 .

[26]  David A. Bader,et al.  On the architectural requirements for efficient execution of graph algorithms , 2005, 2005 International Conference on Parallel Processing (ICPP'05).

[27]  George C. Caragea,et al.  Models for Advancing PRAM and Other Algorithms into Parallel Programs for a PRAM-On-Chip Platform , 2006, Handbook of Parallel Computing.

[28]  George C. Caragea,et al.  Brief announcement: better speedups for parallel max-flow , 2011, SPAA '11.

[29]  Uzi Vishkin,et al.  Experiments with List Ranking for Explicit Multi-Threaded (XMT) Instruction Parallelism , 1999, Algorithm Engineering.

[30]  Michel Daydé,et al.  High Performance Computing for Computational Science - VECPAR 2006, 7th International Conference, Rio de Janeiro, Brazil, June 10-13, 2006, Revised Selected and Invited Papers , 2007, VECPAR.

[31]  P. J. Narayanan,et al.  A fast GPU algorithm for graph connectivity , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW).

[32]  Richard Cole,et al.  Deterministic coin tossing and accelerating cascades: micro and macro techniques for designing parallel algorithms , 1986, STOC '86.