Streaming Batch Eigenupdates for Hardware Neural Networks

Neural networks based on nanodevices, such as metal oxide memristors, phase change memories, and flash memory cells, have generated considerable interest for their increased energy efficiency and density in comparison to graphics processing units (GPUs) and central processing units (CPUs). Though immense acceleration of the training process can be achieved by leveraging the fact that the time complexity of training does not scale with the network size, it is limited by the space complexity of stochastic gradient descent, which grows quadratically. The main objective of this work is to reduce this space complexity by using low-rank approximations of stochastic gradient descent. This low spatial complexity combined with streaming methods allows for significant reductions in memory and compute overhead, opening the door for improvements in area, time and energy efficiency of training. We refer to this algorithm and architecture to implement it as the streaming batch eigenupdate (SBE) approach.

[1]  Maria-Florina Balcan,et al.  An Improved Gap-Dependency Analysis of the Noisy Power Method , 2016, COLT.

[2]  Farnood Merrikh-Bayat,et al.  Efficient training algorithms for neural networks based on memristive crossbar circuits , 2015, 2015 International Joint Conference on Neural Networks (IJCNN).

[3]  Erkki Oja,et al.  Independent component analysis: algorithms and applications , 2000, Neural Networks.

[4]  Farnood Merrikh-Bayat,et al.  3-D Memristor Crossbars for Analog and Neuromorphic Computing Applications , 2017, IEEE Transactions on Electron Devices.

[5]  Yuanzhi Li,et al.  First Efficient Convergence for Streaming k-PCA: A Global, Gap-Free, and Near-Optimal Rate , 2016, 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS).

[6]  Yusuf Leblebici,et al.  Improved Deep Neural Network Hardware-Accelerators Based on Non-Volatile-Memory: The Local Gains Technique , 2017, 2017 IEEE International Conference on Rebooting Computing (ICRC).

[7]  Jascha Sohl-Dickstein,et al.  PCA of high dimensional random walks with comparison to neural network training , 2018, NeurIPS.

[8]  Karim Abed-Meraim,et al.  A New Look at the Power Method for Fast Subspace Tracking , 1999, Digit. Signal Process..

[9]  Moritz Hardt,et al.  The Noisy Power Method: A Meta Algorithm with Applications , 2013, NIPS.

[10]  Maurice Clint,et al.  A Simultaneous Iteration Method for the Unsymmetric Eigenvalue Problem , 1971 .

[11]  Peter Strobach,et al.  Bi-iteration SVD subspace tracking algorithms , 1997, IEEE Trans. Signal Process..

[12]  Hyung-Min Lee,et al.  Analog CMOS-based resistive processing unit for deep neural network training , 2017, 2017 IEEE 60th International Midwest Symposium on Circuits and Systems (MWSCAS).

[13]  Cho-Jui Hsieh,et al.  History PCA: A New Algorithm for Streaming PCA , 2018, 1802.05447.

[14]  Eliana Lorch Visualizing Deep Network Training Trajectories with PCA , 2016 .

[15]  Pritish Narayanan,et al.  Equivalent-accuracy accelerated neural-network training using analogue memory , 2018, Nature.

[16]  Klaus-Robert Müller,et al.  Efficient BackProp , 2012, Neural Networks: Tricks of the Trade.

[17]  Tayfun Gokmen,et al.  The Next Generation of Deep Learning Hardware: Analog Computing , 2019, Proceedings of the IEEE.

[18]  Erkki Oja,et al.  Principal components, minor components, and linear neural networks , 1992, Neural Networks.

[19]  David A. Patterson,et al.  In-datacenter performance analysis of a tensor processing unit , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[20]  E. Leobandung,et al.  Capacitor-based Cross-point Array for Analog Neural Network with Record Symmetry and Linearity , 2018, 2018 IEEE Symposium on VLSI Technology.

[21]  Cameron Musco,et al.  Randomized Block Krylov Methods for Stronger and Faster Approximate Singular Value Decomposition , 2015, NIPS.

[22]  Gina C. Adam Two artificial synapses are better than one , 2018, Nature.

[23]  Peng Lin,et al.  Fully memristive neural networks for pattern classification with unsupervised learning , 2018 .

[24]  M. Prezioso,et al.  A multiply-add engine with monolithically integrated 3D memristor crossbar/CMOS hybrid circuit , 2017, Scientific reports.

[25]  Sanjoy Dasgupta,et al.  The Fast Convergence of Incremental PCA , 2013, NIPS.

[26]  Tayfun Gokmen,et al.  Training LSTM Networks With Resistive Cross-Point Devices , 2018, Front. Neurosci..

[27]  Gökmen Tayfun,et al.  Acceleration of Deep Neural Network Training with Resistive Cross-Point Devices: Design Considerations , 2016, Front. Neurosci..

[28]  E. Oja Simplified neuron model as a principal component analyzer , 1982, Journal of mathematical biology.

[29]  Ed F. Deprettere,et al.  Numerically stable Jacobi array for parallel singular value decomposition (SVD) updating , 1994, Optics & Photonics.

[30]  Robert H. Halstead,et al.  Matrix Computations , 2011, Encyclopedia of Parallel Computing.

[31]  Advait Madhavan,et al.  Streaming Batch Eigenupdates for Hardware Neuromorphic Networks , 2019, ArXiv.

[32]  Bin Yang,et al.  An extension of the PASTd algorithm to both rank and subspace tracking , 1995, IEEE Signal Processing Letters.

[33]  Farnood Merrikh-Bayat,et al.  Training and operation of an integrated neuromorphic network based on metal-oxide memristors , 2014, Nature.

[34]  Ioannis Mitliagkas,et al.  Memory Limited, Streaming PCA , 2013, NIPS.

[35]  Chun-Liang Li,et al.  Rivalry of Two Families of Algorithms for Memory-Restricted Streaming PCA , 2015, AISTATS.