Large-scale energy-efficient graph traversal: A path to efficient data-intensive supercomputing

Graph traversal is a widely used algorithm in a variety of fields, including social networks, business analytics, and high-performance computing among others. There has been a push for HPC machines to be rated not just in Petaflops, but also in "GigaTEPS" (billions of traversed edges per second), and the Graph500 benchmark has been established for this purpose. Graph traversal on single nodes has been well studied and optimized on modern CPU architectures. However, current cluster implementations suffer from high latency data communication with large volumes of transfers across nodes, leading to inefficiency in performance and energy consumption. In this work, we show that we can overcome these constraints using a combination of efficient low-overhead data compression techniques to reduce transfer volumes along with latency-hiding techniques. Using an optimized single node graph traversal algorithm [1], our novel cluster optimizations result in over 6.6X performance improvements over state-of-the-art data transfer techniques, and almost an order of magnitude in energy savings. Our resulting implementation of the Graph500 benchmark achieves 115 GigaTEPS on a 320-node/5120 core Intel® Endeavor cluster with Intel® Xeon® processors E5-2670, which matches the second ranked result in the recent November 2011 Graph500 list [2] with about 5.6X fewer nodes. Our cluster optimizations only have a 1.8X overhead in overall performance from the performance of the optimized single-node implementation, and allows for near-linear scaling with number of nodes. Our algorithm on 1024 nodes on Intel® Xeon® processor X5670-based systems (with lower per-node performance) for a large multi-Terabyte graph attained 195 GigaTEPS in performance, proving the high scalability of our algorithm. Our per-node performance is the highest in the top 10 of the Nov 2011 Graph500 list.

[1]  Ana Paula Appel,et al.  Radius Plots for Mining Tera-byte Scale Graphs: Algorithms, Patterns, and Observations , 2010, SDM.

[2]  Shuang Chen,et al.  The entropy of ordered sequences and order statistics , 1990, IEEE Trans. Inf. Theory.

[3]  Anthony Skjellum,et al.  A Multithreaded Message Passing Interface (MPI) Architecture: Performance and Program Issues , 2001, J. Parallel Distributed Comput..

[4]  David A. Bader,et al.  On the architectural requirements for efficient execution of graph algorithms , 2005, 2005 International Conference on Parallel Processing (ICPP'05).

[5]  Pradeep Dubey,et al.  Fast and Efficient Graph Traversal Algorithm for CPUs: Maximizing Single-Node Efficiency , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[6]  Satoshi Matsuoka,et al.  Performance characteristics of Graph500 on large-scale distributed environment , 2011, 2011 IEEE International Symposium on Workload Characterization (IISWC).

[7]  D. Patterson,et al.  Searching for a Parent Instead of Fighting Over Children : A Fast Breadth-First Search Implementation for Graph 500 , 2011 .

[8]  Hugh E. Williams,et al.  Compressing Integers for Fast File Access , 1999, Comput. J..

[9]  Eduard Ayguadé,et al.  Overlapping communication and computation by using a hybrid MPI/SMPSs approach , 2010, ICS '10.

[10]  Kamesh Madduri,et al.  Parallel breadth-first search on distributed memory systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[11]  Pradeep Dubey,et al.  FAST: fast architecture sensitive tree search on modern CPUs and GPUs , 2010, SIGMOD Conference.

[12]  Jose Sreeram,et al.  UPC Queues for Scalable Graph Traversals: Design and Evaluation on InfiniBand Clusters , 2011 .

[13]  Charles E. Leiserson,et al.  Fat-trees: Universal networks for hardware-efficient supercomputing , 1985, IEEE Transactions on Computers.

[14]  Russ Bubley,et al.  Randomized algorithms , 1995, CSUR.

[15]  David A. Bader,et al.  Advanced Shortest Paths Algorithms on a Massively-Multithreaded Architecture , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[16]  Brian W. Barrett,et al.  Introducing the Graph 500 , 2010 .

[17]  Vicki H. Allan,et al.  Software pipelining , 1995, CSUR.

[18]  Bo Song,et al.  Overlapping Communication and Computation in MPI by Multithreading , 2006, PDPTA.

[19]  Mark Anderson Better benchmarking for supercomputers , 2011 .

[20]  Fabrizio Petrini,et al.  Efficient Breadth-First Search on the Cell/BE Processor , 2008, IEEE Transactions on Parallel and Distributed Systems.

[21]  Kunle Olukotun,et al.  Accelerating CUDA graph algorithms at maximum warp , 2011, PPoPP '11.

[22]  John Shalf,et al.  The International Exascale Software Project roadmap , 2011, Int. J. High Perform. Comput. Appl..

[23]  Amar Phanishayee,et al.  FAWN: a fast array of wimpy nodes , 2009, SOSP '09.

[24]  David A. Bader,et al.  Scalable Graph Exploration on Multicore Processors , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[25]  Charles E. Leiserson,et al.  A work-efficient parallel breadth-first search algorithm (or how to cope with the nondeterminism of reducers) , 2010, SPAA '10.

[26]  Alexander Zeier,et al.  SIMD-Scan: Ultra Fast in-Memory Table Scan using on-Chip Vector Processing Units , 2009, Proc. VLDB Endow..

[27]  Hosung Park,et al.  What is Twitter, a social network or a news media? , 2010, WWW '10.

[28]  Krishna P. Gummadi,et al.  Measurement and analysis of online social networks , 2007, IMC '07.

[29]  J. Koomey Worldwide electricity used in data centers , 2008 .

[30]  Dan Suciu,et al.  A query language for a Web-site management system , 1997, SGMD.

[31]  David A. Bader,et al.  Approximating Betweenness Centrality , 2007, WAW.

[32]  Yinglong Xia TOPOLOGICALLY ADAPTIVE PARALLEL BREADTH-FIRST SEARCH ON MULTICORE PROCESSORS , 2010 .

[33]  Edmond Chow,et al.  A Scalable Distributed Parallel Breadth-First Search Algorithm on BlueGene/L , 2005, ACM/IEEE SC 2005 Conference (SC'05).