论文信息 - Fitting ReLUs via SGD and Quantized SGD - 字舞流文

Fitting ReLUs via SGD and Quantized SGD

In this paper we focus on the problem of finding the optimal weights of the shallowest of neural networks consisting of a single Rectified Linear Unit (ReLU). These functions are of the form x → max(0, 〈w, x〉) with w ∈ ℝd denoting the weight vector. We focus on a planted i model where the inputs are chosen i.i.d. from a Gaussian distribution and the labels are generated according to a planted weight vector. We first show that mini-batch stochastic gradient descent when suitably initialized, converges at a geometric rate to the planted model with a number of samples that is optimal up to numerical constants. Next we focus on a parallel implementation where in each iteration the mini-batch gradient is calculated in a distributed manner across multiple processors and then broadcast to a master or all other processors. To reduce the communication cost in this setting we utilize a Quanitzed Stochastic Gradient Scheme (QSGD) where the partial gradients are quantized. Perhaps unexpectedly, we show that QSGD maintains the fast convergence of SGD to a globally optimal model while significantly reducing the communication cost. We further corroborate our numerical findings via various experiments including distributed implementations over Amazon EC2.

Amir Salman Avestimehr | Mahdi Soltanolkotabi | Seyed Mohammadreza Mousavi Kalan | M. Soltanolkotabi | A. Avestimehr

[1] Dan Alistarh,et al. QSGD: Communication-Optimal Stochastic Gradient Descent, with Applications to Training Neural Networks , 2016, 1610.02132.

[2] Xiao Zhang,et al. Learning One-hidden-layer ReLU Networks via Gradient Descent , 2018, AISTATS.

[3] Adam R. Klivans,et al. Learning Neural Networks with Two Nonlinear Layers in Polynomial Time , 2017, COLT.

[4] Lisandro Dalcin,et al. Parallel distributed computing using Python , 2011 .

[5] Yingbin Liang,et al. Local Geometry of One-Hidden-Layer Neural Networks for Logistic Regression , 2018, ArXiv.

[6] Deanna Needell,et al. Stochastic gradient descent, weighted sampling, and the randomized Kaczmarz algorithm , 2013, Mathematical Programming.

[7] R. Vershynin,et al. A Randomized Kaczmarz Algorithm with Exponential Convergence , 2007, math/0702226.

[8] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[9] Nam Sung Kim,et al. GradiVeQ: Vector Quantization for Bandwidth-Efficient Gradient Aggregation in Distributed CNN Training , 2018, NeurIPS.

[10] R. Srikant,et al. Understanding the Loss Surface of Single-Layered Neural Networks for Binary Classification , 2018, International Conference on Learning Representations.

[11] Jason Weston,et al. A unified architecture for natural language processing: deep neural networks with multitask learning , 2008, ICML '08.

[12] H. Ichimura,et al. SEMIPARAMETRIC LEAST SQUARES (SLS) AND WEIGHTED SLS ESTIMATION OF SINGLE-INDEX MODELS , 1993 .

[13] Chinmay Hegde,et al. Learning ReLU Networks via Alternating Minimization , 2018, ArXiv.

[14] Michael I. Jordan,et al. Minimizing Nonconvex Population Risk from Rough Empirical Risk , 2018, ArXiv.

[15] Mahdi Soltanolkotabi,et al. Learning ReLUs via Gradient Descent , 2017, NIPS.

[16] Raghu Meka,et al. Learning One Convolutional Layer with Overlapping Patches , 2018, ICML.

[17] Yan Shuo Tan,et al. Phase Retrieval via Randomized Kaczmarz: Theoretical Guarantees , 2017, ArXiv.

[18] Adam R. Klivans,et al. Learning Depth-Three Neural Networks in Polynomial Time , 2017, ArXiv.

[19] Eric Moulines,et al. Non-Asymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning , 2011, NIPS.

[20] Kunle Olukotun,et al. Taming the Wild: A Unified Analysis of Hogwild-Style Algorithms , 2015, NIPS.

[21] Kamyar Azizzadenesheli,et al. signSGD: compressed optimisation for non-convex problems , 2018, ICML.

[22] Samet Oymak,et al. Stochastic Gradient Descent Learns State Equations with Nonlinear Activations , 2018, COLT.

[23] F. Clarke. Optimization And Nonsmooth Analysis , 1983 .

[24] W. Härdle,et al. Direct Semiparametric Estimation of Single-Index Models with Discrete Covariates dpsfb950075.ps.tar = Enno MAMMEN J.S. MARRON: Mass Recentered Kernel Smoothers , 1996 .

[25] Raef Bassily,et al. The Power of Interpolation: Understanding the Effectiveness of SGD in Modern Over-parametrized Learning , 2017, ICML.

[26] Gang Wang,et al. Learning ReLU Networks on Linearly Separable Data: Algorithm, Optimality, and Generalization , 2018, IEEE Transactions on Signal Processing.

[27] Adam Tauman Kalai,et al. The Isotron Algorithm: High-Dimensional Isotonic Regression , 2009, COLT.

[28] Varun Kanade,et al. Reliably Learning the ReLU in Polynomial Time , 2016, COLT.

[29] Yuanzhi Li,et al. Convergence Analysis of Two-layer Neural Networks with ReLU Activation , 2017, NIPS.

[30] Dong Yu,et al. 1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs , 2014, INTERSPEECH.

[31] Marc'Aurelio Ranzato,et al. Large Scale Distributed Deep Networks , 2012, NIPS.