暂无分享,去创建一个
Diego Granziol | Stephen Roberts | Xingchen Wan | Stephen J. Roberts | Diego Granziol | Xingchen Wan
[1] Zhangyang Wang,et al. Can We Gain More from Orthogonality Regularizations in Training Deep Networks? , 2018, NeurIPS.
[2] Jürgen Schmidhuber,et al. Learning to Forget: Continual Prediction with LSTM , 2000, Neural Computation.
[3] H. Kushner,et al. Stochastic Approximation and Recursive Algorithms and Applications , 2003 .
[4] Roman Vershynin,et al. High-Dimensional Probability , 2018 .
[5] Sanjiv Kumar,et al. On the Convergence of Adam and Beyond , 2018 .
[6] Beatrice Santorini,et al. Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.
[7] Andrew Gordon Wilson,et al. Loss Surfaces, Mode Connectivity, and Fast Ensembling of DNNs , 2018, NeurIPS.
[8] Léon Bottou,et al. On the Ineffectiveness of Variance Reduced Optimization for Deep Learning , 2018, NeurIPS.
[9] Jinghui Chen,et al. Closing the Generalization Gap of Adaptive Gradient Methods in Training Deep Neural Networks , 2018, IJCAI.
[10] Geoffrey E. Hinton,et al. Lookahead Optimizer: k steps forward, 1 step back , 2019, NeurIPS.
[11] Richard Socher,et al. Improving Generalization Performance by Switching from Adam to SGD , 2017, ArXiv.
[12] Sergey Ioffe,et al. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.
[13] Yann LeCun,et al. The mnist database of handwritten digits , 2005 .
[14] Léon Bottou,et al. Stochastic Gradient Descent Tricks , 2012, Neural Networks: Tricks of the Trade.
[15] Frank Hutter,et al. Decoupled Weight Decay Regularization , 2017, ICLR.
[16] Jian Sun,et al. Identity Mappings in Deep Residual Networks , 2016, ECCV.
[17] Yang Yuan,et al. Asymmetric Valleys: Beyond Sharp and Flat Local Minima , 2019, NeurIPS.
[18] Seong Joon Oh,et al. CutMix: Regularization Strategy to Train Strong Classifiers With Localizable Features , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[19] Naiyan Wang,et al. Data-Driven Sparse Structure Selection for Deep Neural Networks , 2017, ECCV.
[20] Stephen J. Roberts,et al. Towards understanding the true loss surface of deep neural networks using random matrix theory and iterative spectral methods , 2019 .
[21] Richard Socher,et al. Regularizing and Optimizing LSTM Language Models , 2017, ICLR.
[22] Yoram Singer,et al. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..
[23] Liyuan Liu,et al. On the Variance of the Adaptive Learning Rate and Beyond , 2019, ICLR.
[24] Hao Li,et al. Visualizing the Loss Landscape of Neural Nets , 2017, NeurIPS.
[25] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.
[26] Jorge Nocedal,et al. On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima , 2016, ICLR.
[27] Andrew Gordon Wilson,et al. A Simple Baseline for Bayesian Uncertainty in Deep Learning , 2019, NeurIPS.
[28] Guodong Zhang,et al. Three Mechanisms of Weight Decay Regularization , 2018, ICLR.
[29] Kyunghyun Cho,et al. The Break-Even Point on Optimization Trajectories of Deep Neural Networks , 2020, ICLR.
[30] James Martens,et al. New Insights and Perspectives on the Natural Gradient Method , 2014, J. Mach. Learn. Res..
[31] Zhuowen Tu,et al. Aggregated Residual Transformations for Deep Neural Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[32] Quoc V. Le,et al. RandAugment: Practical data augmentation with no separate search , 2019, ArXiv.
[33] Yurii Nesterov,et al. Introductory Lectures on Convex Optimization - A Basic Course , 2014, Applied Optimization.
[34] Tomaso A. Poggio,et al. Theory of Deep Learning IIb: Optimization Properties of SGD , 2018, ArXiv.
[35] Mark D. McDonnell,et al. Training wide residual networks for deployment using a single bit for each weight , 2018, ICLR.
[36] Andrew Gordon Wilson,et al. Averaging Weights Leads to Wider Optima and Better Generalization , 2018, UAI.
[37] Guodong Zhang,et al. Which Algorithmic Choices Matter at Which Batch Sizes? Insights From a Noisy Quadratic Model , 2019, NeurIPS.
[38] Anders Krogh,et al. A Simple Weight Decay Can Improve Generalization , 1991, NIPS.
[39] Yoshua Bengio,et al. Three Factors Influencing Minima in SGD , 2017, ArXiv.
[40] Nathan Srebro,et al. The Marginal Value of Adaptive Gradient Methods in Machine Learning , 2017, NIPS.
[41] Samy Bengio,et al. Understanding deep learning requires rethinking generalization , 2016, ICLR.
[42] Phuong T. Tran,et al. On the Convergence Proof of AMSGrad and a New Version , 2019, IEEE Access.
[43] John C. Duchi. Introductory lectures on stochastic optimization , 2018, IAS/Park City Mathematics Series.
[44] Matthew D. Zeiler. ADADELTA: An Adaptive Learning Rate Method , 2012, ArXiv.
[45] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.
[46] Alex Krizhevsky,et al. Learning Multiple Layers of Features from Tiny Images , 2009 .
[47] Frank Hutter,et al. A Downsampled Variant of ImageNet as an Alternative to the CIFAR datasets , 2017, ArXiv.