High Performance Big Data Graph Analytics Leveraging Near Memory System

Big data graph analytics is the future of high performance computing and key to many current and future applications. There is a growing demand for high performance graph computing for real-world social network graphs. Real-world graph algorithms are memory-intensive and generate a high percentage of accesses to the memory subsystem due to low cache locality. Near memory or 3D die-stacked memory, known for its low latency, high bandwidth communication has the potential to improve the performance of big data graph analytics.In this paper, we analyse, evaluate and compare the performance of a near memory system for big data graph analytics. Real-world graphs associated with social networks and the web are processed with graph analytics algorithms in a simulated near memory system. The performance advantage of near memory with a large number of simple in-order processor cores for graph analysis is presented.The proposed system provides a performance per Watt improvement of $3.55 - 8.55 \times$ for Breadth-First Search algorithm for big data graphs over computing systems with fat cores and traditional Double Data Rate (DDR) memory. The proposed near memory computing system provides a considerable improvement in computational performance of graph analytics algorithms with an average improvement in Instructions Per Cycle (IPC) of $5 \times$ and in performance per Watt of $7 \times$.

[1]  Matei Ripeanu,et al.  On Graphs, GPUs, and Blind Dating: A Workload to Processor Matchmaking Quest , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[2]  David A. Patterson,et al.  Locality Exists in Graph Processing: Workload Characterization on an Ivy Bridge Server , 2015, 2015 IEEE International Symposium on Workload Characterization.

[3]  Guojing Cong,et al.  Optimizing Large-scale Graph Analysis on Multithreaded, Multicore Platforms , 2011, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[4]  Jung Ho Ahn,et al.  McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[5]  Stijn Eyerman,et al.  An Evaluation of High-Level Mechanistic Core Models , 2014, ACM Trans. Archit. Code Optim..

[6]  Jonathan W. Berry,et al.  Challenges in Parallel Graph Processing , 2007, Parallel Process. Lett..

[7]  Harish Patil,et al.  Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.

[8]  Steven Swanson,et al.  Near-Data Processing: Insights from a MICRO-46 Workshop , 2014, IEEE Micro.

[9]  Hosung Park,et al.  What is Twitter, a social network or a news media? , 2010, WWW '10.

[10]  Guojing Cong,et al.  A Study on the Locality Behavior of Minimum Spanning Tree Algorithms , 2006, HiPC.

[11]  Huiru Zheng,et al.  Detection of functional modules from protein interaction networks with an enhanced random walk based algorithm , 2011, Int. J. Comput. Biol. Drug Des..

[12]  Krishna P. Gummadi,et al.  Measurement and analysis of online social networks , 2007, IMC '07.

[13]  Yi Yang,et al.  Efficient Route Planning on Public Transportation Networks: A Labelling Approach , 2015, SIGMOD Conference.

[14]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[15]  Lieven Eeckhout,et al.  Power-aware multi-core simulation for early design stage hardware/software co-optimization , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[16]  Jure Leskovec,et al.  {SNAP Datasets}: {Stanford} Large Network Dataset Collection , 2014 .

[17]  Christopher Hughes,et al.  Scalable HMM based inference engine in large vocabulary continuous speech recognition , 2009, 2009 IEEE International Conference on Multimedia and Expo.

[18]  Joseph M. Hellerstein,et al.  GraphLab: A New Framework For Parallel Machine Learning , 2010, UAI.

[19]  Liang Yuan,et al.  Modeling the Locality in Graph Traversals , 2012, 2012 41st International Conference on Parallel Processing.

[20]  David A. Bader,et al.  On the architectural requirements for efficient execution of graph algorithms , 2005, 2005 International Conference on Parallel Processing (ICPP'05).

[21]  Keshav Pingali,et al.  A quantitative study of irregular programs on GPUs , 2012, 2012 IEEE International Symposium on Workload Characterization (IISWC).