论文信息 - GraphR: Accelerating Graph Processing Using ReRAM

GraphR: Accelerating Graph Processing Using ReRAM

Graph processing recently received intensive interests in light of a wide range of needs to understand relationships. It is well-known for the poor locality and high memory bandwidth requirement. In conventional architectures, they incur a significant amount of data movements and energy consumption which motivates several hardware graph processing accelerators. The current graph processing accelerators rely on memory access optimizations or placing computation logics close to memory. Distinct from all existing approaches, we leverage an emerging memory technology to accelerate graph processing with analog computation. This paper presents GRAPHR, the first ReRAM-based graph processing accelerator. GRAPHR follows the principle of near-data processing and explores the opportunity of performing massive parallel analog operations with low hardware and energy cost. The analog computation is suitable for graph processing because: 1) The algorithms are iterative and could inherently tolerate the imprecision; 2) Both probability calculation (e.g., PageRank and Collaborative Filtering) and typical graph algorithms involving integers (e.g., BFS/SSSP) are resilient to errors. The key insight of GRAPHR is that if a vertex program of a graph algorithm can be expressed in sparse matrix vector multiplication (SpMV), it can be efficiently performed by ReRAM crossbar. We show that this assumption is generally true for a large set of graph algorithms. GRAPHR is a novel accelerator architecture consisting of two components: memory ReRAM and graph engine (GE). The core graph computations are performed in sparse matrix format in GEs (ReRAM crossbars). The vector/matrix-based graph computation is not new, but ReRAM offers the unique opportunity to realize the massive parallelism with unprecedented energy efficiency and low hardware cost. With small subgraphs processed by GEs, the gain of performing parallel operations overshadows the wastes due to sparsity. The experiment results show that GRAPHR achieves a 16.01× (up to 132.67×) speedup and a 33.82× energy saving on geometric mean compared to a CPU baseline system. Compared to GPU, GRAPHR achieves 1.69× to 2.19× speedup and consumes 4.77× to 8.91× less energy. GRAPHR gains a speedup of 1.16× to 4.12×, and is 3.67× to 10.96× more energy efficiency compared to PIM-based architecture.

[1] Michael Kaminsky,et al. Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles , 2013, SOSP 2013.

[2] Tao Zhang,et al. PRIME: A Novel Processing-in-Memory Architecture for Neural Network Computation in ReRAM-Based Main Memory , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[3] Shimeng Yu,et al. Metal–Oxide RRAM , 2012, Proceedings of the IEEE.

[4] Ryan A. Rossi,et al. The Network Data Repository with Interactive Graph Analytics and Visualization , 2015, AAAI.

[5] James Bennett,et al. The Netflix Prize , 2007 .

[6] Marie-Anne Neimat,et al. Oracle TimesTen: An In-Memory Database for Enterprise Applications , 2013, IEEE Data Eng. Bull..

[7] Wolfgang Lehner,et al. SAP HANA database: data management for modern business applications , 2012, SGMD.

[8] Qing Wu,et al. Hardware realization of BSB recall function using memristor crossbar arrays , 2012, DAC Design Automation Conference 2012.

[9] Juan Miguel García-Gómez,et al. BIOINFORMATICS APPLICATIONS NOTE Sequence analysis Manipulation of FASTQ data with Galaxy , 2005 .

[10] Natalie D. Enright Jerger,et al. Cnvlutin: Ineffectual-Neuron-Free Deep Neural Network Computing , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[11] Guy E. Blelloch,et al. GraphChi: Large-Scale Graph Computation on Just a PC , 2012, OSDI.

[12] Gilad Mishne,et al. Finding high-quality content in social media , 2008, WSDM '08.

[13] John D. Owens,et al. Gunrock: a high-performance graph processing library on the GPU , 2015, PPoPP.

[14] Jia Wang,et al. DaDianNao: A Machine-Learning Supercomputer , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[15] Duane Mills,et al. 19.7 A 16Gb ReRAM with 200MB/s write and 1GB/s read in 27nm technology , 2014, 2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC).

[16] Jintao Yu,et al. Parallel matrix multiplication on memristor-based computation-in-memory architecture , 2016, 2016 International Conference on High Performance Computing & Simulation (HPCS).

[17] Margaret Martonosi,et al. Graphicionado: A high-performance and energy-efficient accelerator for graph analytics , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[18] J. Thomas Pawlowski,et al. Hybrid memory cube (HMC) , 2011, 2011 IEEE Hot Chips 23 Symposium (HCS).

[19] Ozcan Ozturk,et al. Energy Efficient Architecture for Graph Analytics Accelerators , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[20] Peng Wang,et al. Replication-Based Fault-Tolerance for Large-Scale Graph Processing , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[21] Dragomir R. Radev,et al. Book Review: Graph-Based Natural Language Processing and Information Retrieval by Rada Mihalcea and Dragomir Radev , 2011, CL.

[22] Reena Panda,et al. Proxy-Guided Load Balancing of Graph Processing Workloads on Heterogeneous Clusters , 2016, 2016 45th International Conference on Parallel Processing (ICPP).

[23] Ramyad Hadidi,et al. GraphPIM: Enabling Instruction-Level PIM Offloading in Graph Computing Frameworks , 2017, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[24] Cong Xu,et al. NVSim: A Circuit-Level Performance, Energy, and Area Model for Emerging Nonvolatile Memory , 2012, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[25] Chung-Wei Hsu,et al. Self-rectifying bipolar TaOx/TiO2 RRAM with superior endurance over 1012 cycles for 3D high-density storage-class memory , 2013, 2013 Symposium on VLSI Technology.

[26] Stefano Battiston,et al. A model of a trust-based recommendation system on a social network , 2006, Autonomous Agents and Multi-Agent Systems.

[27] Ligang Gao,et al. High precision tuning of state for memristive devices by adaptable variation-tolerant algorithm , 2011, Nanotechnology.

[28] Miao Hu,et al. ISAAC: A Convolutional Neural Network Accelerator with In-Situ Analog Arithmetic in Crossbars , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[29] Willy Zwaenepoel,et al. X-Stream: edge-centric graph processing using streaming partitions , 2013, SOSP.

[30] Kevin P. Murphy,et al. Machine learning - a probabilistic perspective , 2012, Adaptive computation and machine learning series.

[31] Andreas Gerstlauer,et al. Fine-grained power analysis of emerging graph processing workloads for cloud operations management , 2016, 2016 IEEE International Conference on Big Data (Big Data).

[32] Ronald L. Rivest,et al. Introduction to Algorithms , 1990 .

[33] Christian Biemann,et al. Chinese Whispers - an Efficient Graph Clustering Algorithm and its Application to Natural Language Processing Problems , 2006 .

[34] Kiyoung Choi,et al. PIM-enabled instructions: A low-overhead, locality-aware processing-in-memory architecture , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[35] Peter Druschel,et al. Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles , 2011, SOSP 2011.

[36] Robert W. Williams,et al. Complex trait analysis of gene expression uncovers polygenic and pleiotropic networks that modulate nervous system function , 2005, Nature Genetics.

[37] Joseph M. Hellerstein,et al. GraphLab: A New Framework For Parallel Machine Learning , 2010, UAI.

[38] Rajiv Gupta,et al. Load the Edges You Need: A Generic I/O Optimization for Disk-based Graph Processing , 2016, USENIX Annual Technical Conference.

[39] Reynold Xin,et al. GraphX: a resilient distributed graph system on Spark , 2013, GRADES.

[40] Binyu Zang,et al. PowerLyra: Differentiated Graph Computation and Partitioning on Skewed Graphs , 2019, TOPC.

[41] Rajeev Motwani,et al. The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[42] A. Taleb-Bendiab,et al. A Comparative Study into Distributed Load Balancing Algorithms for Cloud Computing , 2010, 2010 IEEE 24th International Conference on Advanced Information Networking and Applications Workshops.

[43] Reena Panda,et al. Data partitioning strategies for graph workloads on heterogeneous clusters , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[44] Tao Zhang,et al. Overcoming the challenges of crossbar resistive memory architectures , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[45] Mendel Rosenblum,et al. Fast crash recovery in RAMCloud , 2011, SOSP.

[46] Jure Leskovec,et al. Community Structure in Large Networks: Natural Cluster Sizes and the Absence of Large Well-Defined Clusters , 2008, Internet Math..

[47] Yue Zhao,et al. LightGraph: Lighten Communication in Distributed Graph-Parallel Processing , 2014, 2014 IEEE International Congress on Big Data.

[48] J. Ticehurst. Cacti , 1983 .

[49] Brendan J. Frey,et al. Graphical Models for Machine Learning and Digital Communication , 1998 .

[50] Carlos Guestrin,et al. Distributed GraphLab : A Framework for Machine Learning and Data Mining in the Cloud , 2012 .

[51] Michael I. Jordan,et al. Graphical Models, Exponential Families, and Variational Inference , 2008, Found. Trends Mach. Learn..

[52] Panos Kalnis,et al. Mizan: a system for dynamic load balancing in large-scale graph processing , 2013, EuroSys '13.

[53] Wenguang Chen,et al. GridGraph: Large-Scale Graph Processing on a Single Machine Using 2-Level Hierarchical Partitioning , 2015, USENIX ATC.

[54] Yun Liang,et al. CuMF_SGD: Fast and Scalable Matrix Factorization , 2016, ArXiv.

[55] Hadi Esmaeilzadeh,et al. TABLA: A unified template-based framework for accelerating statistical machine learning , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[56] Yiran Chen,et al. PipeLayer: A Pipelined ReRAM-Based Accelerator for Deep Learning , 2017, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[57] Kinam Kim,et al. A fast, high-endurance and scalable non-volatile memory device made from asymmetric Ta2O(5-x)/TaO(2-x) bilayer structures. , 2011, Nature materials.

[58] Christoforos E. Kozyrakis,et al. ZSim: fast and accurate microarchitectural simulation of thousand-core systems , 2013, ISCA.

[59] Engin Ipek,et al. Memristive Boltzmann machine: A hardware accelerator for combinatorial optimization and deep learning , 2017, 2017 Fifth Berkeley Symposium on Energy Efficient Electronic Systems & Steep Transistors Workshop (E3S).

[60] Jure Leskovec,et al. {SNAP Datasets}: {Stanford} Large Network Dataset Collection , 2014 .

[61] Kathryn S. McKinley,et al. Tile size selection using cache organization and data layout , 1995, PLDI '95.

[62] Mehrbakhsh Nilashi,et al. Collaborative filtering recommender systems , 2013 .

[63] Catherine Graves,et al. Dot-product engine for neuromorphic computing: Programming 1T1M crossbar to accelerate matrix-vector multiplication , 2016, 2016 53nd ACM/EDAC/IEEE Design Automation Conference (DAC).

[64] Giovanni Vigna,et al. NetSTAT: a network-based intrusion detection approach , 1998, Proceedings 14th Annual Computer Security Applications Conference (Cat. No.98EX217).

[65] Kiyoung Choi,et al. A scalable processing-in-memory accelerator for parallel graph processing , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[66] Vijayalakshmi Srinivasan,et al. Enhancing lifetime and security of PCM-based Main Memory with Start-Gap Wear Leveling , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[67] Christoforos E. Kozyrakis,et al. Practical Near-Data Processing for In-Memory Analytics Frameworks , 2015, 2015 International Conference on Parallel Architecture and Compilation (PACT).

[68] Greg Linden,et al. Amazon . com Recommendations Item-to-Item Collaborative Filtering , 2001 .

[69] Cong Xu,et al. Design of cross-point metal-oxide ReRAM emphasizing reliability and cost , 2013, 2013 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[70] Hao Jiang,et al. RENO: A high-efficient reconfigurable neuromorphic computing accelerator design , 2015, 2015 52nd ACM/EDAC/IEEE Design Automation Conference (DAC).

[71] Aart J. C. Bik,et al. Pregel: a system for large-scale graph processing , 2010, SIGMOD Conference.

[72] Joseph Gonzalez,et al. PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs , 2012, OSDI.