An Early Evaluation of the Scalability of Graph Algorithms on the Intel MIC Architecture

Graph algorithms are notorious for not getting good speedup on parallel architectures. These algorithms tend to suffer from irregular dependencies and a high synchronization cost that prevent an efficient execution on distributed memory machines. Hence such algorithms are mostly parallelized on shared memory machines. However, current commodity shared memory machines do not typically offer enough parallelism to process these problems. In this paper, we are presenting an early investigation of the scalability of such algorithms on Intel's upcoming Many Integrated Core (Intel MIC) architecture which, when it will be released in 2012, is expected to provide more than 50 physical cores with SMT capability. The Intel MIC architecture can be programmed through many programming models, here we investigate the three most popular of these models namely OpenMP, Cilk Plus and Intel's TBB. We present scalability results of a parallel graph coloring algorithm, three variations of a breadth-first search algorithm and a micro benchmark for irregular computations using these three programming models. Our results on a prototype board show that the multi-threaded architecture of Intel MIC can be effectively used for hiding latencies in irregular applications to achieve almost perfect speedup.

[1]  Edmond Chow,et al.  A Scalable Distributed Parallel Breadth-First Search Algorithm on BlueGene/L , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[2]  Alex Pothen,et al.  What Color Is Your Jacobian? Graph Coloring for Computing Derivatives , 2005, SIAM Rev..

[3]  Michael A. Bender,et al.  Online Scheduling of Parallel Programs on Heterogeneous Systems with Applications to Cilk , 2002, SPAA '00.

[4]  Thomas H. Cormen,et al.  Introduction to algorithms [2nd ed.] , 2001 .

[5]  David A. Bader,et al.  SNAP, Small-world Network Analysis and Partitioning: An open-source parallel graph framework for the exploration of large-scale networks , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[6]  Denis Trystram,et al.  A Tighter Analysis of Work Stealing , 2010, ISAAC.

[7]  Ümit V. Çatalyürek,et al.  A framework for scalable greedy coloring on distributed-memory parallel computers , 2008, J. Parallel Distributed Comput..

[8]  J. Culberson Iterated Greedy Graph Coloring and the Difficulty Landscape , 1992 .

[9]  Matteo Frigo,et al.  Reducers and other Cilk++ hyperobjects , 2009, SPAA '09.

[10]  Bradley C. Kuszmaul,et al.  Cilk: an efficient multithreaded runtime system , 1995, PPOPP '95.

[11]  Darren J. Kerbyson,et al.  A General Performance Model of Structured and Unstructured Mesh Particle Transport Computations , 2005, The Journal of Supercomputing.

[12]  Xin-She Yang,et al.  Introduction to Algorithms , 2021, Nature-Inspired Optimization Algorithms.

[13]  Paul D. Hovland,et al.  Metrics and models for reordering transformations , 2004, MSP '04.

[14]  Charles E. Leiserson,et al.  A work-efficient parallel breadth-first search algorithm (or how to cope with the nondeterminism of reducers) , 2010, SPAA '10.

[15]  David Zuckerman,et al.  Electronic Colloquium on Computational Complexity, Report No. 100 (2005) Linear Degree Extractors and the Inapproximability of MAX CLIQUE and CHROMATIC NUMBER , 2005 .

[16]  U. Brandes A faster algorithm for betweenness centrality , 2001 .

[17]  Assefaw H. Gebremedhin,et al.  Scalable parallel graph coloring algorithms , 2000 .

[18]  Ümit V. Çatalyürek,et al.  Graph coloring algorithms for multi-core and massively multithreaded architectures , 2012, Parallel Comput..

[19]  Kivanc Dincer,et al.  A Comparison of Parallel Graph Coloring Algorithms , 1995 .

[20]  J. J. Moré,et al.  Estimation of sparse jacobian matrices and graph coloring problems , 1983 .

[21]  Jonathan W. Berry,et al.  Challenges in Parallel Graph Processing , 2007, Parallel Process. Lett..