Scalable Graph Exploration on Multicore Processors

Many important problems in computational sciences, social network analysis, security, and business analytics, are data-intensive and lend themselves to graph-theoretical analyses. In this paper we investigate the challenges involved in exploring very large graphs by designing a breadth-first search (BFS) algorithm for advanced multi-core processors that are likely to become the building blocks of future exascale systems. Our new methodology for large-scale graph analytics combines a highlevel algorithmic design that captures the machine-independent aspects, to guarantee portability with performance to future processors, with an implementation that embeds processorspecific optimizations. We present an experimental study that uses state-of-the-art Intel Nehalem EP and EX processors and up to 64 threads in a single system. Our performance on several benchmark problems representative of the power-law graphs found in real-world problems reaches processing rates that are competitive with supercomputing results in the recent literature. In the experimental evaluation we prove that our graph exploration algorithm running on a 4-socket Nehalem EX is (1) 2.4 times faster than a Cray XMT with 128 processors when exploring a random graph with 64 million vertices and 512 millions edges, (2) capable of processing 550 million edges per second with an R-MAT graph with 200 million vertices and 1 billion edges, comparable to the performance of a similar graph on a Cray MTA-2 with 40 processors and (3) 5 times faster than 256 BlueGene/L processors on a graph with average degree 50.

[1]  Kam-Hoi Cheng,et al.  A fast graph search multiprocessor algorithm , 1997, Proceedings of the IEEE 1997 National Aerospace and Electronics Conference. NAECON 1997.

[2]  Katherine Yelick,et al.  Introduction to UPC and Language Specification , 2000 .

[3]  A. Klimovitski,et al.  Using SSE and SSE2 : Misconceptions and reality , 2001 .

[4]  M E J Newman,et al.  Fast algorithm for detecting community structure in networks. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[5]  M. Newman,et al.  Finding community structure in very large networks. , 2004, Physical review. E, Statistical, nonlinear, and soft matter physics.

[6]  Mark Newman,et al.  Detecting community structure in networks , 2004 .

[7]  M E J Newman,et al.  Finding and evaluating community structure in networks. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[8]  David A. Bader,et al.  On the architectural requirements for efficient execution of graph algorithms , 2005, 2005 International Conference on Parallel Processing (ICPP'05).

[9]  A. Arenas,et al.  Community detection in complex networks using extremal optimization. , 2005, Physical review. E, Statistical, nonlinear, and soft matter physics.

[10]  Edmond Chow,et al.  A Scalable Distributed Parallel Breadth-First Search Algorithm on BlueGene/L , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[11]  David A. Bader,et al.  Designing Multithreaded Algorithms for Breadth-First Search and st-connectivity on the Cray MTA-2 , 2006, 2006 International Conference on Parallel Processing (ICPP'06).

[12]  Dinesh Manocha,et al.  A Simple Path Non-existence Algorithm Using C-Obstacle Query , 2006, WAFR.

[13]  Nachiket Kapre,et al.  GraphStep: A System Architecture for Sparse-Graph Algorithms , 2006, 2006 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines.

[14]  David A. Bader,et al.  GTgraph : A Synthetic Graph Generator Suite , 2006 .

[15]  Samuel Williams,et al.  The potential of the cell processor for scientific computing , 2005, CF '06.

[16]  Peter M. Kogge,et al.  Evaluating synchronization techniques for light-weight multithreaded/multicore architectures , 2007, SPAA '07.

[17]  Alexandros Stamatakis,et al.  RAxML-Cell: Parallel Phylogenetic Tree Inference on the Cell Broadband Engine , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[18]  Dinesh Manocha,et al.  Real-time Path Planning for Virtual Agents in Dynamic Environments , 2007, VR.

[19]  David A. Bader,et al.  On the Design and Analysis of Irregular Algorithms on the Cell Processor: A Case Study of List Ranking , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[20]  Michael A. Laurenzano,et al.  High-frequency simulations of global seismic wave propagation using SPECFEM3D_GLOBE on 62K processors , 2008, HiPC 2008.

[21]  Fabrizio Petrini,et al.  Efficient Breadth-First Search on the Cell/BE Processor , 2008, IEEE Transactions on Parallel and Distributed Systems.

[22]  Jeroen Tromp,et al.  High-frequency simulations of global seismic wave propagation using SPECFEM3D_GLOBE on 62K processors , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[23]  John Giacomoni,et al.  FastForward for efficient pipeline parallelism: a cache-optimized concurrent lock-free queue , 2008, PPoPP.

[24]  Matthias S. Müller,et al.  Memory Performance and Cache Coherency Effects on an Intel Nehalem Multiprocessor System , 2009, 2009 18th International Conference on Parallel Architectures and Compilation Techniques.

[25]  David Mizell,et al.  Early experiences with large-scale Cray XMT systems , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[26]  Yinglong Xia TOPOLOGICALLY ADAPTIVE PARALLEL BREADTH-FIRST SEARCH ON MULTICORE PROCESSORS , 2010 .