暂无分享,去创建一个
Aurélien Lucchi | Antonio Orvieto | Thomas Hofmann | Dario Pavllo | Jonas Köhler | T. Hofmann | Aurélien Lucchi | Dario Pavllo | Jonas Köhler | Antonio Orvieto
[1] James Martens,et al. New Insights and Perspectives on the Natural Gradient Method , 2014, J. Mach. Learn. Res..
[2] Tero Karras,et al. Training Generative Adversarial Networks with Limited Data , 2020, NeurIPS.
[3] Matus Telgarsky,et al. Benefits of Depth in Neural Networks , 2016, COLT.
[4] Andreas Veit,et al. Why are Adaptive Methods Good for Attention Models? , 2020, NeurIPS.
[5] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.
[6] Nathan Srebro,et al. Lower Bounds for Non-Convex Stochastic Optimization , 2019, ArXiv.
[7] Jaakko Lehtinen,et al. Analyzing and Improving the Image Quality of StyleGAN , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[8] Suvrit Sra,et al. Why Gradient Clipping Accelerates Training: A Theoretical Justification for Adaptivity , 2019, ICLR.
[9] Philipp Hennig,et al. Dissecting Adam: The Sign, Magnitude and Variance of Stochastic Gradients , 2017, ICML.
[10] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.
[11] Yoshua Bengio,et al. Generative Adversarial Nets , 2014, NIPS.
[12] Saeed Ghadimi,et al. Stochastic First- and Zeroth-Order Methods for Nonconvex Stochastic Programming , 2013, SIAM J. Optim..
[13] Francesco Orabona,et al. Momentum-Based Variance Reduction in Non-Convex SGD , 2019, NeurIPS.
[14] Quoc V. Le,et al. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks , 2019, ICML.
[15] Jinghui Chen,et al. Closing the Generalization Gap of Adaptive Gradient Methods in Training Deep Neural Networks , 2018, IJCAI.
[16] Yurii Nesterov,et al. Lectures on Convex Optimization , 2018 .
[17] M. Nica,et al. Products of Many Large Random Matrices and Gradients in Deep Neural Networks , 2018, Communications in Mathematical Physics.
[18] Jascha Sohl-Dickstein,et al. A Mean Field Theory of Batch Normalization , 2019, ICLR.
[19] L. Devroye. Non-Uniform Random Variate Generation , 1986 .
[20] Natalia Gimelshein,et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.
[21] Yuanzhi Li,et al. A Convergence Theory for Deep Learning via Over-Parameterization , 2018, ICML.
[22] D K Smith,et al. Numerical Optimization , 2001, J. Oper. Res. Soc..
[23] Tengyu Ma,et al. Fixup Initialization: Residual Learning Without Normalization , 2019, ICLR.
[24] Carla P. Gomes,et al. Understanding Batch Normalization , 2018, NeurIPS.
[25] Frederik Kunstner,et al. Limitations of the empirical Fisher approximation for natural gradient descent , 2019, NeurIPS.
[26] Tie-Yan Liu,et al. Convergence Theory of Learning Over-parameterized ResNet: A Full Characterization. , 2019 .
[27] I. Gemp. The Unreasonable Effectiveness of Adam on Cycles , 2019 .
[28] Mark Chen,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.
[29] Liwei Wang,et al. Gradient Descent Finds Global Minima of Deep Neural Networks , 2018, ICML.
[30] Thomas Hofmann,et al. Exponential convergence rates for Batch Normalization: The power of length-direction decoupling in non-convex optimization , 2018, AISTATS.
[31] Jeff Donahue,et al. Large Scale GAN Training for High Fidelity Natural Image Synthesis , 2018, ICLR.
[32] Liyuan Liu,et al. On the Variance of the Adaptive Learning Rate and Beyond , 2019, ICLR.
[33] A. C. Berry. The accuracy of the Gaussian approximation to the sum of independent variates , 1941 .
[34] Alex Krizhevsky,et al. Learning Multiple Layers of Features from Tiny Images , 2009 .
[35] Lysandre Debut,et al. HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.
[36] Yoshua Bengio,et al. Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.
[37] Surya Ganguli,et al. The Emergence of Spectral Universality in Deep Networks , 2018, AISTATS.
[38] Jian Sun,et al. Identity Mappings in Deep Residual Networks , 2016, ECCV.
[39] Arindam Banerjee,et al. Hessian based analysis of SGD for Deep Nets: Dynamics and Generalization , 2019, SDM.
[40] Francis Bach,et al. On the Convergence of Adam and Adagrad , 2020, ArXiv.
[41] Yuan Cao,et al. Stochastic Gradient Descent Optimizes Over-parameterized Deep ReLU Networks , 2018, ArXiv.
[42] Nathan Srebro,et al. The Marginal Value of Adaptive Gradient Methods in Machine Learning , 2017, NIPS.
[43] M. Springer,et al. The Distribution of Products of Beta, Gamma and Gaussian Random Variables , 1970 .
[44] Yoshua Bengio,et al. Equilibrated adaptive learning rates for non-convex optimization , 2015, NIPS.
[45] Bo Chen,et al. MnasNet: Platform-Aware Neural Architecture Search for Mobile , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[46] Sergey Ioffe,et al. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.
[47] Sanjiv Kumar,et al. Escaping Saddle Points with Adaptive Gradient Methods , 2019, ICML.
[48] Michael W. Mahoney,et al. PyHessian: Neural Networks Through the Lens of the Hessian , 2019, 2020 IEEE International Conference on Big Data (Big Data).
[49] Zhanxing Zhu,et al. The Anisotropic Noise in Stochastic Gradient Descent: Its Behavior of Escaping from Sharp Minima and Regularization Effects , 2018, ICML.
[50] Surya Ganguli,et al. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks , 2013, ICLR.
[51] Francis Bach,et al. Batch normalization provably avoids ranks collapse for randomly initialised deep networks , 2020, NeurIPS.
[52] C. L. Mallows,et al. Inequalities of Chebyshev Type Involving Conditional Expectations , 1969 .
[53] Thomas Hofmann,et al. Escaping Saddles with Stochastic Gradients , 2018, ICML.
[54] Jian Sun,et al. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).
[55] K. P. Choi. On the medians of gamma distributions and an equation of Ramanujan , 1994 .
[56] Michael I. Jordan,et al. Gradient Descent Can Take Exponential Time to Escape Saddle Points , 2017, NIPS.
[57] Tengyu Ma,et al. Identity Matters in Deep Learning , 2016, ICLR.
[58] Yoram Singer,et al. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..
[59] Ruosong Wang,et al. On Exact Computation with an Infinitely Wide Neural Net , 2019, NeurIPS.
[60] Sepp Hochreiter,et al. Untersuchungen zu dynamischen neuronalen Netzen , 1991 .
[61] Samuel L. Smith,et al. Batch Normalization Biases Residual Blocks Towards the Identity Function in Deep Networks , 2020, NeurIPS.
[62] Zhouchen Lin,et al. Sharp Analysis for Nonconvex SGD Escaping from Saddle Points , 2019, COLT.
[63] Taesung Park,et al. Semantic Image Synthesis With Spatially-Adaptive Normalization , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[64] Razvan Pascanu,et al. On the difficulty of training recurrent neural networks , 2012, ICML.
[65] Roland Vollgraf,et al. Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms , 2017, ArXiv.
[66] N. Temme. Special Functions: An Introduction to the Classical Functions of Mathematical Physics , 1996 .
[67] Klaus-Robert Müller,et al. Efficient BackProp , 2012, Neural Networks: Tricks of the Trade.
[68] Antoine Labatie,et al. Characterizing Well-Behaved vs. Pathological Deep Neural Networks , 2018, ICML.
[69] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[70] Orestis Georgiou,et al. Product of n independent uniform random variables , 2009 .
[71] Frank Hutter,et al. Decoupled Weight Decay Regularization , 2017, ICLR.
[72] Yoshua Bengio,et al. Learning long-term dependencies with gradient descent is difficult , 1994, IEEE Trans. Neural Networks.
[73] Daniel P. Robinson,et al. Exploiting negative curvature in deterministic and stochastic optimization , 2017, Mathematical Programming.