Memory-efficient training with streaming dimensionality reduction

The movement of large quantities of data during the training of a Deep Neural Network presents immense challenges for machine learning workloads. To minimize this overhead, especially on the movement and calculation of gradient information, we introduce streaming batch principal component analysis as an update algorithm. Streaming batch principal component analysis uses stochastic power iterations to generate a stochastic k-rank approximation of the network gradient. We demonstrate that the low rank updates produced by streaming batch principal component analysis can effectively train convolutional neural networks on a variety of common datasets, with performance comparable to standard mini batch gradient descent. These results can lead to both improvements in the design of application specific integrated circuits for deep learning and in the speed of synchronization of machine learning models trained with data parallelism.

[1]  Advait Madhavan,et al.  Streaming Batch Eigenupdates for Hardware Neural Networks , 2019, Front. Neurosci..

[2]  E. Vianello,et al.  Bio-Inspired Stochastic Computing Using Binary CBRAM Synapses , 2013, IEEE Transactions on Electron Devices.

[3]  Jascha Sohl-Dickstein,et al.  Measuring the Effects of Data Parallelism on Neural Network Training , 2018, J. Mach. Learn. Res..

[4]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[5]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[6]  Yuanzhi Li,et al.  First Efficient Convergence for Streaming k-PCA: A Global, Gap-Free, and Near-Optimal Rate , 2016, 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS).

[7]  Bin Yang,et al.  An extension of the PASTd algorithm to both rank and subspace tracking , 1995, IEEE Signal Processing Letters.

[8]  Maria-Florina Balcan,et al.  An Improved Gap-Dependency Analysis of the Noisy Power Method , 2016, COLT.

[9]  Ioannis Mitliagkas,et al.  Memory Limited, Streaming PCA , 2013, NIPS.

[10]  Chun-Liang Li,et al.  Rivalry of Two Families of Algorithms for Memory-Restricted Streaming PCA , 2015, AISTATS.

[11]  E. Oja Simplified neuron model as a principal component analyzer , 1982, Journal of mathematical biology.

[12]  Moritz Hardt,et al.  The Noisy Power Method: A Meta Algorithm with Applications , 2013, NIPS.

[13]  Robert H. Halstead,et al.  Matrix Computations , 2011, Encyclopedia of Parallel Computing.

[14]  Farnood Merrikh-Bayat,et al.  3-D Memristor Crossbars for Analog and Neuromorphic Computing Applications , 2017, IEEE Transactions on Electron Devices.

[15]  E. Leobandung,et al.  Capacitor-based Cross-point Array for Analog Neural Network with Record Symmetry and Linearity , 2018, 2018 IEEE Symposium on VLSI Technology.

[16]  Wei Lin,et al.  Characterizing Deep Learning Training Workloads on Alibaba-PAI , 2019, 2019 IEEE International Symposium on Workload Characterization (IISWC).

[17]  Cho-Jui Hsieh,et al.  History PCA: A New Algorithm for Streaming PCA , 2018, 1802.05447.

[18]  Cong Xu,et al.  TernGrad: Ternary Gradients to Reduce Communication in Distributed Deep Learning , 2017, NIPS.

[19]  Jorge Nocedal,et al.  On the limited memory BFGS method for large scale optimization , 1989, Math. Program..

[20]  C. Loan,et al.  A Storage-Efficient $WY$ Representation for Products of Householder Transformations , 1989 .

[21]  Ali Khiat,et al.  Challenges hindering memristive neuromorphic hardware from going mainstream , 2018, Nature Communications.

[22]  Peter Strobach,et al.  Bi-iteration SVD subspace tracking algorithms , 1997, IEEE Trans. Signal Process..

[23]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[24]  Dmitri B. Strukov,et al.  Towards the Development of Analog Neuromorphic Chip Prototype with 2.4M Integrated Memristors , 2019, 2019 IEEE International Symposium on Circuits and Systems (ISCAS).

[25]  H.-S. Philip Wong,et al.  Monolithic 3-D Integration , 2019, IEEE Micro.

[26]  Alexander J. Smola,et al.  Communication Efficient Distributed Machine Learning with the Parameter Server , 2014, NIPS.

[27]  Hans-Arno Jacobsen,et al.  Scalable Deep Learning on Distributed Infrastructures , 2019, ACM Comput. Surv..

[28]  Farnood Merrikh-Bayat,et al.  Training and operation of an integrated neuromorphic network based on metal-oxide memristors , 2014, Nature.

[29]  William J. Dally,et al.  Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training , 2017, ICLR.

[30]  David A. Patterson,et al.  In-datacenter performance analysis of a tensor processing unit , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[31]  E. Oja,et al.  On stochastic approximation of the eigenvectors and eigenvalues of the expectation of a random matrix , 1985 .

[32]  Erkki Oja,et al.  Principal components, minor components, and linear neural networks , 1992, Neural Networks.

[33]  M. Prezioso,et al.  A multiply-add engine with monolithically integrated 3D memristor crossbar/CMOS hybrid circuit , 2017, Scientific reports.

[34]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[35]  Pritish Narayanan,et al.  Accelerating Deep Neural Networks with Analog Memory Devices , 2019, 2019 China Semiconductor Technology International Conference (CSTIC).

[36]  Zhigang Luo,et al.  Online Nonnegative Matrix Factorization With Robust Stochastic Approximation , 2012, IEEE Transactions on Neural Networks and Learning Systems.