Student Specialization in Deep Rectified Networks With Finite Width and Input Dimension
暂无分享,去创建一个
[1] Stefano Soatto,et al. Entropy-SGD: biasing gradient descent into wide valleys , 2016, ICLR.
[2] Rui Peng,et al. Network Trimming: A Data-Driven Neuron Pruning Approach towards Efficient Deep Architectures , 2016, ArXiv.
[3] David Rolnick,et al. Complexity of Linear Regions in Deep Networks , 2019, ICML.
[4] Andrea Montanari,et al. A mean field view of the landscape of two-layer neural networks , 2018, Proceedings of the National Academy of Sciences.
[5] David Rolnick,et al. Deep ReLU Networks Have Surprisingly Few Activation Patterns , 2019, NeurIPS.
[6] Michael I. Jordan,et al. How to Escape Saddle Points Efficiently , 2017, ICML.
[7] Kenji Kawaguchi,et al. Deep Learning without Poor Local Minima , 2016, NIPS.
[8] David Saad,et al. Dynamics of On-Line Gradient Descent Learning for Multilayer Neural Networks , 1995, NIPS.
[9] Léon Bottou,et al. Large-Scale Machine Learning with Stochastic Gradient Descent , 2010, COMPSTAT.
[10] Yann LeCun,et al. Optimal Brain Damage , 1989, NIPS.
[11] Liwei Wang,et al. Gradient Descent Finds Global Minima of Deep Neural Networks , 2018, ICML.
[12] Rich Caruana,et al. Overfitting in Neural Nets: Backpropagation, Conjugate Gradient, and Early Stopping , 2000, NIPS.
[13] Surya Ganguli,et al. An analytic theory of generalization dynamics and transfer learning in deep linear networks , 2018, ICLR.
[14] Demis Hassabis,et al. Mastering the game of Go with deep neural networks and tree search , 2016, Nature.
[15] Andrew L. Maas. Rectifier Nonlinearities Improve Neural Network Acoustic Models , 2013 .
[16] Suvrit Sra,et al. Global optimality conditions for deep neural networks , 2017, ICLR.
[17] David Saad,et al. Online Learning in Radial Basis Function Networks , 1997, Neural Computation.
[18] Elad Hazan,et al. An optimal algorithm for stochastic strongly-convex optimization , 2010, 1006.2425.
[19] Andrew Gordon Wilson,et al. Loss Surfaces, Mode Connectivity, and Fast Ensembling of DNNs , 2018, NeurIPS.
[20] Thomas Hofmann,et al. Escaping Saddles with Stochastic Gradients , 2018, ICML.
[21] Jascha Sohl-Dickstein,et al. Measuring the Effects of Data Parallelism on Neural Network Training , 2018, J. Mach. Learn. Res..
[22] Ohad Shamir,et al. Spurious Local Minima are Common in Two-Layer ReLU Neural Networks , 2017, ICML.
[23] Thomas Laurent,et al. The Multilinear Structure of ReLU Networks , 2017, ICML.
[24] Christian Van den Broeck,et al. Statistical Mechanics of Learning , 2001 .
[25] Jascha Sohl-Dickstein,et al. SVCCA: Singular Vector Canonical Correlation Analysis for Deep Learning Dynamics and Interpretability , 2017, NIPS.
[26] Stefano Soatto,et al. Stochastic Gradient Descent Performs Variational Inference, Converges to Limit Cycles for Deep Networks , 2017, 2018 Information Theory and Applications Workshop (ITA).
[27] Kurt Hornik,et al. Multilayer feedforward networks are universal approximators , 1989, Neural Networks.
[28] John Shawe-Taylor,et al. Canonical Correlation Analysis: An Overview with Application to Learning Methods , 2004, Neural Computation.
[29] Fred A. Hamprecht,et al. Essentially No Barriers in Neural Network Energy Landscape , 2018, ICML.
[30] Guang Cheng,et al. Optimal Rate of Convergence for Deep Neural Network Classifiers under the Teacher-Student Setting , 2020, ArXiv.
[31] Michael I. Jordan,et al. Gradient Descent Can Take Exponential Time to Escape Saddle Points , 2017, NIPS.
[32] Samy Bengio,et al. Understanding deep learning requires rethinking generalization , 2016, ICLR.
[33] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.
[34] Lei Wu. How SGD Selects the Global Minima in Over-parameterized Learning : A Dynamical Stability Perspective , 2018 .
[35] Yuandong Tian,et al. Luck Matters: Understanding Training Dynamics of Deep ReLU Networks , 2019, ArXiv.
[36] Furong Huang,et al. Escaping From Saddle Points - Online Stochastic Gradient for Tensor Decomposition , 2015, COLT.
[37] Raef Bassily,et al. The Power of Interpolation: Understanding the Effectiveness of SGD in Modern Over-parametrized Learning , 2017, ICML.
[38] Nicolas Macris,et al. The committee machine: computational to statistical gaps in learning a two-layers neural network , 2018, NeurIPS.
[39] Surya Ganguli,et al. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks , 2013, ICLR.
[40] Yuanzhi Li,et al. Learning and Generalization in Overparameterized Neural Networks, Going Beyond Two Layers , 2018, NeurIPS.
[41] Yuanzhi Li,et al. A Convergence Theory for Deep Learning via Over-Parameterization , 2018, ICML.
[42] Hod Lipson,et al. Convergent Learning: Do different neural networks learn the same representations? , 2015, FE@NIPS.
[43] Suvrit Sra,et al. Small nonlinearities in activation functions create bad local minima in neural networks , 2018, ICLR.
[44] Francis Bach,et al. On the Global Convergence of Gradient Descent for Over-parameterized Models using Optimal Transport , 2018, NeurIPS.
[45] Yuanzhi Li,et al. Learning Overparameterized Neural Networks via Stochastic Gradient Descent on Structured Data , 2018, NeurIPS.
[46] Yann Ollivier,et al. Natural Langevin Dynamics for Neural Networks , 2017, GSI.
[47] Wei Hu,et al. A Convergence Analysis of Gradient Descent for Deep Linear Neural Networks , 2018, ICLR.
[48] Yuandong Tian,et al. Gradient Descent Learns One-hidden-layer CNN: Don't be Afraid of Spurious Local Minima , 2017, ICML.
[49] Raef Bassily,et al. On exponential convergence of SGD in non-convex over-parametrized learning , 2018, ArXiv.
[50] Anthony C. C. Coolen,et al. Statistical mechanical analysis of the dynamics of learning in perceptrons , 1997, Stat. Comput..
[51] Thomas Laurent,et al. Deep Linear Networks with Arbitrary Loss: All Local Minima Are Global , 2017, ICML.
[52] Sanjeev Arora,et al. Explaining Landscape Connectivity of Low-cost Solutions for Multilayer Nets , 2019, NeurIPS.
[53] E. Gardner,et al. Three unfinished works on the optimal storage capacity of networks , 1989 .
[54] Arthur Jacot,et al. Neural tangent kernel: convergence and generalization in neural networks (invited paper) , 2018, NeurIPS.
[55] Gregory J. Wolff,et al. Optimal Brain Surgeon and general network pruning , 1993, IEEE International Conference on Neural Networks.
[56] Yee Whye Teh,et al. Bayesian Learning via Stochastic Gradient Langevin Dynamics , 2011, ICML.
[57] Mikhail Belkin,et al. MaSS: an Accelerated Stochastic Method for Over-parametrized Learning , 2018, ArXiv.
[58] Matthias Hein,et al. The Loss Surface of Deep and Wide Neural Networks , 2017, ICML.
[59] Andrew M. Saxe,et al. High-dimensional dynamics of generalization error in neural networks , 2017, Neural Networks.
[60] David M. Blei,et al. Stochastic Gradient Descent as Approximate Bayesian Inference , 2017, J. Mach. Learn. Res..
[61] David Rolnick,et al. Identifying Weights and Architectures of Unknown ReLU Networks , 2019, ArXiv.
[62] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[63] Florent Krzakala,et al. Dynamics of stochastic gradient descent for two-layer neural networks in the teacher–student setup , 2019, NeurIPS.
[64] Saad,et al. On-line learning in soft committee machines. , 1995, Physical review. E, Statistical physics, plasmas, fluids, and related interdisciplinary topics.
[65] John N. Tsitsiklis,et al. Gradient Convergence in Gradient methods with Errors , 1999, SIAM J. Optim..
[66] Tengyu Ma,et al. Learning One-hidden-layer Neural Networks with Landscape Design , 2017, ICLR.