论文信息 - Graph processing and machine learning architectures with emerging memory technologies: a survey

Graph processing and machine learning architectures with emerging memory technologies: a survey

This paper surveys domain-specific architectures (DSAs) built from two emerging memory technologies. Hybrid memory cube (HMC) and high bandwidth memory (HBM) can reduce data movement between memory and computation by placing computing logic inside memory dies. On the other hand, the emerging non-volatile memory, metal-oxide resistive random access memory (ReRAM) has been considered as a promising candidate for future memory architecture due to its high density, fast read access and low leakage power. The key feature is ReRAM’s capability to perform the inherently parallel in-situ matrix-vector multiplication in the analog domain. We focus on the DSAs for two important applications—graph processing and machine learning acceleration. Based on the understanding of the recent architectures and our research experience, we also discuss several potential research directions.

Xuehai Qian | Xuehai Qian

[1] Xuehai Qian,et al. HyPar: Towards Hybrid Parallelism for Deep Learning Accelerator Array , 2019, 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[2] Rajeev Balasubramonian,et al. Quantifying the relationship between the power delivery network and architectural policies in a 3D-stacked memory device , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[3] Jack J. Dongarra,et al. Enabling and scaling matrix computations on heterogeneous multi-core and multi-GPU systems , 2012, ICS '12.

[4] Meng-Fan Chang,et al. A 16Mb dual-mode ReRAM macro with sub-14ns computing-in-memory and memory functions enabled by self-write termination scheme , 2017, 2017 IEEE International Electron Devices Meeting (IEDM).

[5] Margaret Martonosi,et al. Graphicionado: A high-performance and energy-efficient accelerator for graph analytics , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[6] Andrew G. Howard,et al. Some Improvements on Deep Convolutional Neural Network Based Image Classification , 2013, ICLR.

[7] Farnood Merrikh-Bayat,et al. Training and operation of an integrated neuromorphic network based on metal-oxide memristors , 2014, Nature.

[8] Iryna Gurevych,et al. Analysis of the Wikipedia Category Graph for NLP Applications , 2007 .

[9] Karin Strauss,et al. Accelerating Deep Convolutional Neural Networks Using Specialized Hardware , 2015 .

[10] Manoj Alwani,et al. Fused-layer CNN accelerators , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[11] Ivan Laptev,et al. Learning and Transferring Mid-level Image Representations Using Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[12] Long Jin,et al. Understanding Graph Sampling Algorithms for Social Network Analysis , 2011, 2011 31st International Conference on Distributed Computing Systems Workshops.

[13] Pengyu Liu,et al. Large-Area WS2 Film with Big Single Domains Grown by Chemical Vapor Deposition , 2017, Nanoscale Research Letters.

[14] Chung-Wei Hsu,et al. Self-rectifying bipolar TaOx/TiO2 RRAM with superior endurance over 1012 cycles for 3D high-density storage-class memory , 2013, 2013 Symposium on VLSI Technology.

[15] Xin Jin,et al. ASAP: Fast, Approximate Graph Pattern Mining at Scale , 2018, OSDI.

[16] Geoffrey E. Hinton,et al. Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[17] Xiao Liu,et al. Basic Performance Measurements of the Intel Optane DC Persistent Memory Module , 2019, ArXiv.

[18] Stephen P. Boyd,et al. Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers , 2011, Found. Trends Mach. Learn..

[19] Marco S. Nobile,et al. Graphics processing units in bioinformatics, computational biology and systems biology , 2016, Briefings Bioinform..

[20] Shuchuan Lo,et al. WMR--A Graph-Based Algorithm for Friend Recommendation , 2006, 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2006 Main Conference Proceedings)(WI'06).

[21] Ozcan Ozturk,et al. Energy Efficient Architecture for Graph Analytics Accelerators , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[22] Wei Li,et al. Tux2: Distributed Graph Computation for Machine Learning , 2017, NSDI.

[23] Hisashi Shima,et al. Resistive Random Access Memory (ReRAM) Based on Metal Oxides , 2010, Proceedings of the IEEE.

[24] Wang Guo-yu. Study of network security evaluation based on attack graph model , 2007 .

[25] Aart J. C. Bik,et al. Pregel: a system for large-scale graph processing , 2010, SIGMOD Conference.

[26] Karin Strauss,et al. Toward accelerating deep learning at scale using specialized hardware in the datacenter , 2015, 2015 IEEE Hot Chips 27 Symposium (HCS).

[27] Shimeng Yu,et al. Metal–Oxide RRAM , 2012, Proceedings of the IEEE.

[28] Bo Wu,et al. AutoMine: harmonizing high-level abstraction and high performance for graph mining , 2019, SOSP.

[29] William J. Dally,et al. Cost-Efficient Dragonfly Topology for Large-Scale Systems , 2009, IEEE Micro.

[30] Sarala M. Wimalaratne,et al. The Systems Biology Graphical Notation , 2009, Nature Biotechnology.

[31] Huan Liu,et al. Graph Mining Applications to Social Network Analysis , 2010, Managing and Mining Graph Data.

[32] Yiran Chen,et al. PipeLayer: A Pipelined ReRAM-Based Accelerator for Deep Learning , 2017, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[33] Charbel Farhat,et al. Accelerated mesh sampling for the hyper reduction of nonlinear computational models , 2017 .

[34] David A. Patterson,et al. A new golden age for computer architecture , 2019, Commun. ACM.

[35] Yu Huang,et al. Spara: An Energy-Efficient ReRAM-Based Accelerator for Sparse Graph Analytics Applications , 2020, 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[36] Chun Chen,et al. Personalized tag recommendation using graph-based ranking on multi-type interrelated objects , 2009, SIGIR.

[37] Katrin Kirchhoff,et al. Data-Driven Graph Construction for Semi-Supervised Graph-Based Learning in NLP , 2007, NAACL.

[38] Luca Maria Gambardella,et al. Flexible, High Performance Convolutional Neural Networks for Image Classification , 2011, IJCAI.

[39] Joseph Gonzalez,et al. PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs , 2012, OSDI.

[40] O. Krestinskaya,et al. Memristive GAN in Analog , 2020, Scientific Reports.

[41] Luca Benini,et al. ChewBaccaNN: A Flexible 223 TOPS/W BNN Accelerator , 2020, 2021 IEEE International Symposium on Circuits and Systems (ISCAS).

[42] Tinoosh Mohsenin,et al. BiNMAC: Binarized neural Network Manycore ACcelerator , 2018, ACM Great Lakes Symposium on VLSI.

[43] Jian Cheng,et al. Quantized Convolutional Neural Networks for Mobile Devices , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[44] Satu Elisa Schaeffer,et al. Survey Graph clustering , 2007 .

[45] Yann LeCun,et al. Convolutional neural networks applied to house numbers digit classification , 2012, Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012).

[46] Phil Blunsom,et al. A Convolutional Neural Network for Modelling Sentences , 2014, ACL.

[47] Yousef Saad,et al. Iterative methods for sparse linear systems , 2003 .

[48] Gerald Penn,et al. Applying Convolutional Neural Networks concepts to hybrid NN-HMM model for speech recognition , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[49] Zhongyuan Yu,et al. Infrared Plasmonic Refractive Index Sensor with Ultra-High Figure of Merit Based on the Optimized All-Metal Grating , 2017, Nanoscale Research Letters.

[50] Meikang Qiu,et al. Security-aware optimization for ubiquitous computing systems with SEAT graph approach , 2013, J. Comput. Syst. Sci..

[51] Jia Wang,et al. DaDianNao: A Machine-Learning Supercomputer , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[52] Michael Ferdman,et al. Maximizing CNN accelerator efficiency through resource partitioning , 2016, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[53] Honglak Lee,et al. Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations , 2009, ICML '09.

[54] Qing Wu,et al. Hardware realization of BSB recall function using memristor crossbar arrays , 2012, DAC Design Automation Conference 2012.

[55] William J. Dally,et al. Flattened butterfly: a cost-efficient topology for high-radix networks , 2007, ISCA '07.

[56] Eunhyeok Park,et al. Weighted-Entropy-Based Quantization for Deep Neural Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[57] D. McAlpine,et al. Hidden hearing loss selectively impairs neural adaptation to loud sound environments , 2018, Nature Communications.

[58] K. Pingali,et al. Pangolin , 2019, Proc. VLDB Endow..

[59] Rajeev Balasubramonian,et al. Newton: Gravitating Towards the Physical Limits of Crossbar Acceleration , 2018, IEEE Micro.

[60] Xuehai Qian,et al. AccPar: Tensor Partitioning for Heterogeneous Deep Learning Accelerators , 2020, 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[61] Satu Elisa Schaeffer,et al. Graph Clustering , 2017, Encyclopedia of Machine Learning and Data Mining.

[62] A. M. Stankovic,et al. Graph oriented algorithm for the steady-state security enhancement in distribution networks , 1989 .

[63] Dan Williams,et al. Platform Storage Performance With 3D XPoint Technology , 2017, Proceedings of the IEEE.

[64] Duane Mills,et al. 19.7 A 16Gb ReRAM with 200MB/s write and 1GB/s read in 27nm technology , 2014, 2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC).

[65] J. Demmel,et al. Solving Sparse Linear Systems with Sparse Backward Error , 2015 .

[66] Runze Han,et al. Demonstration of Logic Operations in High-Performance RRAM Crossbar Array Fabricated by Atomic Layer Deposition Technique , 2017, Nanoscale Research Letters.

[67] Christoforos E. Kozyrakis,et al. GraphP: Reducing Communication for PIM-Based Graph Processing with Efficient Data Partition , 2018, 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[68] Hal Daumé,et al. Fast Large-Scale Approximate Graph Construction for NLP , 2012, EMNLP.

[69] Yoshua Bengio,et al. Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[70] Sudhakar Yalamanchili,et al. Neurocube: A Programmable Digital Neuromorphic Architecture with High-Density 3D Memory , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[71] Mehrzad Samadi,et al. Memory-centric system interconnect design with hybrid memory cubes , 2013, PACT 2013.

[72] Jung Ho Ahn,et al. NDA: Near-DRAM acceleration architecture leveraging commodity DRAM devices and standard memory modules , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[73] Chao Di,et al. U1 snRNP regulates cancer cell migration and invasion in vitro , 2020, Nature Communications.

[74] Andrew S. Cassidy,et al. A million spiking-neuron integrated circuit with a scalable communication network and interface , 2014, Science.

[75] Naren Ramakrishnan,et al. Studying Recommendation Algorithms by Graph Analysis , 2003, Journal of Intelligent Information Systems.

[76] Qing Wu,et al. Efficient and self-adaptive in-situ learning in multilayer memristor neural networks , 2018, Nature Communications.

[77] Brian Kingsbury,et al. New types of deep neural network learning for speech recognition and related applications: an overview , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[78] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[79] Asit K. Mishra,et al. From high-level deep neural models to FPGAs , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[80] Yangqing Jia,et al. Deep Convolutional Ranking for Multilabel Image Annotation , 2013, ICLR.

[81] M. Mitchell Waldrop,et al. The chips are down for Moore’s law , 2016, Nature.

[82] Hao Jiang,et al. RENO: A high-efficient reconfigurable neuromorphic computing accelerator design , 2015, 2015 52nd ACM/EDAC/IEEE Design Automation Conference (DAC).

[83] Wei Niu. PatDNN: Achieving Real-Time DNN Execution on Mobile Devices with Pattern-based Weight Pruning , 2020 .

[84] Tao Zhang,et al. Overcoming the challenges of crossbar resistive memory architectures , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[85] Yiran Chen,et al. ReBNN: in-situ acceleration of binarized neural networks in ReRAM using complementary resistive cell , 2019, CCF Transactions on High Performance Computing.

[86] Catherine Graves,et al. Dot-product engine for neuromorphic computing: Programming 1T1M crossbar to accelerate matrix-vector multiplication , 2016, 2016 53nd ACM/EDAC/IEEE Design Automation Conference (DAC).

[87] Yu Wang,et al. PRIME: A Novel Processing-in-Memory Architecture for Neural Network Computation in ReRAM-Based Main Memory , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[88] Yanzhi Wang,et al. PatDNN: Achieving Real-Time DNN Execution on Mobile Devices with Pattern-based Weight Pruning , 2020, ASPLOS.

[89] Kiyoung Choi,et al. A scalable processing-in-memory accelerator for parallel graph processing , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[90] Zhengya Zhang,et al. A fully integrated reprogrammable memristor–CMOS system for efficient multiply–accumulate operations , 2019, Nature Electronics.

[91] Joseph M. Hellerstein,et al. Distributed GraphLab: A Framework for Machine Learning in the Cloud , 2012, Proc. VLDB Endow..

[92] Anton J. Enright,et al. BioLayout-an automatic graph layout algorithm for similarity visualization , 2001, Bioinform..

[93] F. Jensen. Introduction to Computational Chemistry , 1998 .

[94] Jaejin Lee,et al. 25.2 A 1.2V 8Gb 8-channel 128GB/s high-bandwidth memory (HBM) stacked DRAM with effective microbump I/O test methods using 29nm process and TSV , 2014, 2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC).

[95] Guy E. Blelloch,et al. Ligra: a lightweight graph processing framework for shared memory , 2013, PPoPP '13.

[96] Dmitri B. Strukov,et al. Implementation of multilayer perceptron network with highly uniform passive memristive crossbar circuits , 2017, Nature Communications.

[97] Geoffrey E. Hinton,et al. Deep Learning , 2015, Nature.

[98] François Fouss,et al. Random-Walk Computation of Similarities between Nodes of a Graph with Application to Collaborative Recommendation , 2007, IEEE Transactions on Knowledge and Data Engineering.

[99] A. Thomas,et al. Memristor-based neural networks , 2013 .

[100] Keval Vora,et al. Peregrine: a pattern-aware graph mining system , 2020, EuroSys.

[101] Vijayalakshmi Srinivasan,et al. Enhancing lifetime and security of PCM-based Main Memory with Start-Gap Wear Leveling , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[102] Luca Benini,et al. XNOR Neural Engine: A Hardware Accelerator IP for 21.6-fJ/op Binary Neural Network Inference , 2018, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[103] Huaping Zhao,et al. Nanoelectrode design from microminiaturized honeycomb monolith with ultrathin and stiff nanoscaffold for high-energy micro-supercapacitors , 2020, Nature Communications.

[104] Miao Hu,et al. ISAAC: A Convolutional Neural Network Accelerator with In-Situ Analog Arithmetic in Crossbars , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[105] Natalie D. Enright Jerger,et al. Cnvlutin: Ineffectual-Neuron-Free Deep Neural Network Computing , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[106] Wenguang Chen,et al. GridGraph: Large-Scale Graph Processing on a Single Machine Using 2-Level Hierarchical Partitioning , 2015, USENIX ATC.

[107] Jürgen Schmidhuber,et al. Multi-column deep neural networks for image classification , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[108] Benno Schwikowski,et al. Graph-based methods for analysing networks in cell biology , 2006, Briefings Bioinform..

[109] Hadi Esmaeilzadeh,et al. TABLA: A unified template-based framework for accelerating statistical machine learning , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[110] Yiran Chen,et al. Learning Structured Sparsity in Deep Neural Networks , 2016, NIPS.

[111] Kinam Kim,et al. A fast, high-endurance and scalable non-volatile memory device made from asymmetric Ta2O(5-x)/TaO(2-x) bilayer structures. , 2011, Nature materials.

[112] Lawrence D. Jackel,et al. Backpropagation Applied to Handwritten Zip Code Recognition , 1989, Neural Computation.

[113] Dmitri B. Strukov,et al. Towards the Development of Analog Neuromorphic Chip Prototype with 2.4M Integrated Memristors , 2019, 2019 IEEE International Symposium on Circuits and Systems (ISCAS).

[114] Keshav Pingali,et al. The tao of parallelism in algorithms , 2011, PLDI '11.

[115] Kai Wang,et al. RStream: Marrying Relational Algebra with Streaming for Efficient Graph Mining on A Single Machine , 2018, OSDI.

[116] Michael Ferdman,et al. Escher: A CNN Accelerator with Flexible Buffering to Minimize Off-Chip Transfer , 2017, 2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM).

[117] Mohammed J. Zaki,et al. Arabesque: a system for distributed graph mining , 2015, SOSP.

[118] Arie E. Kaufman,et al. GPU Cluster for High Performance Computing , 2004, Proceedings of the ACM/IEEE SC2004 Conference.

[119] Jung-Hwan Moon,et al. A self-rectifying TaOy/nanoporous TaOx memristor synaptic array for learning and energy-efficient neuromorphic systems , 2018, NPG Asia Materials.

[120] Meng-Fan Chang,et al. 17.5 A 3T1R nonvolatile TCAM using MLC ReRAM with Sub-1ns search time , 2015, 2015 IEEE International Solid-State Circuits Conference - (ISSCC) Digest of Technical Papers.

[121] Yiran Chen,et al. GraphR: Accelerating Graph Processing Using ReRAM , 2017, 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[122] Masahide Matsumoto,et al. A 130.7-$\hbox{mm}^{2}$ 2-Layer 32-Gb ReRAM Memory Device in 24-nm Technology , 2014, IEEE Journal of Solid-State Circuits.

[123] Yanzhi Wang,et al. GraphQ: Scalable PIM-Based Graph Processing , 2019, MICRO.

[124] Lei Jiang,et al. Die Stacking (3D) Microarchitecture , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[125] Luca Benini,et al. XNORBIN: A 95 TOp/s/W hardware accelerator for binary convolutional neural networks , 2018, 2018 IEEE Symposium in Low-Power and High-Speed Chips (COOL CHIPS).

[126] Yoon Kim,et al. Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[127] Bo Hong,et al. Neural signal analysis with memristor arrays towards high-efficiency brain–machine interfaces , 2020, Nature Communications.

[128] Ninghui Sun,et al. DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning , 2014, ASPLOS.

[129] Song Han,et al. Learning both Weights and Connections for Efficient Neural Network , 2015, NIPS.

[130] Yu Wang,et al. MNSIM: Simulation Platform for Memristor-Based Neuromorphic Computing System , 2016, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[131] Jason Weston,et al. A unified architecture for natural language processing: deep neural networks with multitask learning , 2008, ICML '08.

[132] Dmitri B. Strukov,et al. 3D ReRAM arrays and crossbars: Fabrication, characterization and applications , 2017, 2017 IEEE 17th International Conference on Nanotechnology (IEEE-NANO).

[133] P. Harrison. Quantum wells, wires, and dots : theoretical and computational physics , 2016 .

[134] Jiayu Li,et al. ADMM-NN: An Algorithm-Hardware Co-Design Framework of DNNs Using Alternating Direction Methods of Multipliers , 2018, ASPLOS.

[135] William M. Campbell,et al. Social Network Analysis with Content and Graphs , 2013 .

[136] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[137] Depei Qian,et al. SympleGraph: distributed graph processing with precise loop-carried dependency guarantee , 2020, PLDI.

[138] MutluOnur,et al. A scalable processing-in-memory accelerator for parallel graph processing , 2015 .

[139] Asit K. Mishra,et al. From High-Level Deep Network Models to FPGA Acceleration , 2016 .