Reducing communication in parallel graph search algorithms with software caches

In many scientific and computational domains, graphs are used to represent and analyze data. Such graphs often exhibit the characteristics of small-world networks: few high-degree vertexes connect many low-degree vertexes. Despite the randomness in a graph search, it is possible to capitalize on the characteristics of small-world networks and cache relevant information of high-degree vertexes. We applied this idea by caching remote vertex ids in a parallel breadth-first search benchmark. Our experiment with different implementations demonstrated significant performance improvements over the reference implementation in several configurations, using 64 to 1024 cores. We proposed a system design in which resources are dedicated exclusively to caching and shared among a set of nodes. Our evaluation demonstrates that this design reduces communication and has the potential to improve performance on large-scale systems in which the communication cost increases significantly with the distance between nodes. We also tested a memcached system as the cache server finding that its generic protocol, which does not match our usage semantics, hinders significantly the potential performance improvements and suggested that a generic system should also support a basic and lightweight communication protocol to meet the needs of high-performance computing applications. Finally, we explored different configurations to find efficient ways to utilize the resources allocated to solve a given problem size; to this extent, we found utilizing half of the compute cores per allocated node improves performance, and even in this case, caching variants always outperform the reference implementation.

[1]  Edmond Chow,et al.  A Scalable Distributed Parallel Breadth-First Search Algorithm on BlueGene/L , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[2]  John R. Gilbert,et al.  High-Performance Graph Algorithms from Parallel Sparse Matrices , 2006, PARA.

[3]  Doug Burger,et al.  Evaluating Future Microprocessors: the SimpleScalar Tool Set , 1996 .

[4]  David A. Bader,et al.  Scalable Graph Exploration on Multicore Processors , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[5]  Brad Fitzpatrick,et al.  Distributed caching with memcached , 2004 .

[6]  Oskar Mencer,et al.  HAGAR: Efficient Multi-context Graph Processors , 2002, FPL.

[7]  Kunle Olukotun,et al.  Efficient Parallel Graph Exploration on Multi-Core CPU and GPU , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.

[8]  Barry V. Hess,et al.  Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis , 2010, HiPC 2010.

[9]  Eva Hocks,et al.  Gordon: design, performance, and experiences deploying and supporting a data intensive supercomputer , 2012, XSEDE '12.

[10]  Ramesh Subramonian,et al.  LogP: towards a realistic model of parallel computation , 1993, PPOPP '93.

[11]  Satoshi Matsuoka,et al.  Performance characteristics of Graph500 on large-scale distributed environment , 2011, 2011 IEEE International Symposium on Workload Characterization (IISWC).

[12]  Fabio Checconi,et al.  Breaking the speed and scalability Barriers for Graph exploration on distributed-memory machines , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[13]  David A. Patterson,et al.  Direction-optimizing Breadth-First Search , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[14]  Edmond Chow,et al.  Distributed Breadth-First Search with 2-D Partitioning , 2005 .

[15]  Nachiket Kapre,et al.  GraphStep: A System Architecture for Sparse-Graph Algorithms , 2006, 2006 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines.

[16]  David A. Patterson,et al.  Direction-optimizing breadth-first search , 2012, HiPC 2012.

[17]  Xin-She Yang,et al.  Introduction to Algorithms , 2021, Nature-Inspired Optimization Algorithms.

[18]  Ümit V. Çatalyürek,et al.  An Early Evaluation of the Scalability of Graph Algorithms on the Intel MIC Architecture , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum.

[19]  Martin D. F. Wong,et al.  An effective GPU implementation of breadth-first search , 2010, Design Automation Conference.

[20]  David Mizell,et al.  Early experiences with large-scale Cray XMT systems , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[21]  David A. Bader,et al.  Designing irregular parallel algorithms with mutual exclusion and lock-free protocols , 2006, J. Parallel Distributed Comput..

[22]  Fabrizio Petrini,et al.  Efficient Breadth-First Search on the Cell/BE Processor , 2008, IEEE Transactions on Parallel and Distributed Systems.

[23]  Yongbing Huang,et al.  Evaluation and Optimization of Breadth-First Search on NUMA Cluster , 2012, 2012 IEEE International Conference on Cluster Computing.

[24]  Guojing Cong,et al.  Fast PGAS Implementation of Distributed Graph Algorithms , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[25]  David A. Bader,et al.  On the architectural requirements for efficient execution of graph algorithms , 2005, 2005 International Conference on Parallel Processing (ICPP'05).

[26]  David A. Bader,et al.  Designing Multithreaded Algorithms for Breadth-First Search and st-connectivity on the Cray MTA-2 , 2006, 2006 International Conference on Parallel Processing (ICPP'06).

[27]  P. J. Narayanan,et al.  Accelerating Large Graph Algorithms on the GPU Using CUDA , 2007, HiPC.

[28]  Christos Faloutsos,et al.  Kronecker Graphs: An Approach to Modeling Networks , 2008, J. Mach. Learn. Res..

[29]  Manfred Glesner,et al.  Proceedings of the Reconfigurable Computing Is Going Mainstream, 12th International Conference on Field-Programmable Logic and Applications , 2002 .

[30]  Brian W. Barrett,et al.  Introducing the Graph 500 , 2010 .