Exploitation of locality for energy efficiency for breadth first search in fine-grain execution models

In the upcoming exa-scale era, the exploitation of data locality in parallel programs is very important because it benefits both program performance and energy efficiency. However, this is a hard topic for graph algorithms such as the Breadth First Search (BFS) due to the irregular data access patterns. This study analyzes the exploitation of data locality in the BFS and its impact on the energy efficiency with the Codelet fine-grain dataflow-inspired execution model. The Codelet Model more efficiently exploits data locality than the OpenMP-like execution models which traditionally focus on coarse-grain parallelism inside loops. A BFS algorithm is then given to exploit the locality between two loop iterations that belong to two different loops (inter-loop locality). This kind of locality can be exploited by the Codelet Model but not by traditional coarse-grain execution models like OpenMP. Tests were performed on fsim which is a simulation platform developed by Intel for the Ubiquitous High Performance Computing (UHPC) project to design future exa-scale architectures. The results show that this BFS algorithm saves up to 7% of the dynamic energy for memory accesses compared to a BFS implementation based on OpenMP loop scheduling.

[1]  Kamesh Madduri,et al.  Parallel breadth-first search on distributed memory systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[2]  Pradeep Dubey,et al.  Fast and Efficient Graph Traversal Algorithm for CPUs: Maximizing Single-Node Efficiency , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[3]  Guang R. Gao,et al.  TiNy threads: a thread virtual machine for the Cyclops64 cellular architecture , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[4]  David A. Bader,et al.  Designing Multithreaded Algorithms for Breadth-First Search and st-connectivity on the Cray MTA-2 , 2006, 2006 International Conference on Parallel Processing (ICPP'06).

[5]  Koji Ueno,et al.  Highly scalable graph search for the Graph500 benchmark , 2012, HPDC '12.

[6]  Ming C. Lin,et al.  Real-time Path Planning for Virtual Agents in Dynamic Environments , 2007, 2007 IEEE Virtual Reality Conference.

[7]  Edmond Chow,et al.  A Scalable Distributed Parallel Breadth-First Search Algorithm on BlueGene/L , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[8]  李丽,et al.  《Tsinghua Science and Technology》网上国际审稿 , 2002 .

[9]  Krishna P. Gummadi,et al.  Measurement and analysis of online social networks , 2007, IMC '07.

[10]  Robert G. Gallager,et al.  A new distributed algorithm to find breadth first search trees , 1987, IEEE Trans. Inf. Theory.

[11]  L. Segal John , 2013, The Messianic Secret.

[12]  Rob C. Knauerhase,et al.  For extreme parallelism, your OS is Sooooo last-millennium , 2012, HotPar'12.

[13]  Panos M. Pardalos,et al.  Parallel Processing of Discrete Optimization Problems , 1995 .

[14]  Fabrizio Petrini,et al.  Efficient Breadth-First Search on the Cell/BE Processor , 2008, IEEE Transactions on Parallel and Distributed Systems.

[15]  David A. Bader,et al.  Scalable Graph Exploration on Multicore Processors , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[16]  Guang R. Gao,et al.  Massively parallel breadth first search using a tree-structured memory model , 2012, PMAM '12.

[17]  Josep Torrellas Architectures for Extreme-Scale Computing , 2009, Computer.

[18]  Pradeep Dubey,et al.  Large-scale energy-efficient graph traversal: A path to efficient data-intensive supercomputing , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.