暂无分享,去创建一个
[1] Anatoli B. Juditsky,et al. Unifying mirror descent and dual averaging , 2019, ArXiv.
[2] Stéphan Clémençon,et al. Gossip Dual Averaging for Decentralized Optimization of Pairwise Functions , 2016, ICML.
[3] Anders Krogh,et al. A Simple Weight Decay Can Improve Generalization , 1991, NIPS.
[4] Zeyuan Allen-Zhu,et al. How To Make the Gradients Small Stochastically: Even Faster Convex and Nonconvex SGD , 2018, NeurIPS.
[5] Geoffrey E. Hinton,et al. On the importance of initialization and momentum in deep learning , 2013, ICML.
[6] Colin Wei,et al. Regularization Matters: Generalization and Optimization of Neural Nets v.s. their Induced Kernel , 2018, NeurIPS.
[7] Yurii Nesterov,et al. Primal-dual subgradient methods for convex problems , 2005, Math. Program..
[8] Lin Xiao,et al. Dual Averaging Methods for Regularized Stochastic Learning and Online Optimization , 2009, J. Mach. Learn. Res..
[9] Stephen J. Wright,et al. Manifold Identification in Dual Averaging for Regularized Stochastic Online Learning , 2012, J. Mach. Learn. Res..
[10] Pascal Vincent,et al. fastMRI: An Open Dataset and Benchmarks for Accelerated MRI , 2018, ArXiv.
[11] Sanja Fidler,et al. Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).
[12] Nathan Srebro,et al. The Marginal Value of Adaptive Gradient Methods in Machine Learning , 2017, NIPS.
[13] Yu. Nesterov,et al. Quasi-monotone Subgradient Methods for Nonsmooth Convex Minimization , 2015, J. Optim. Theory Appl..
[14] L. Bottou. Stochastic Gradient Learning in Neural Networks , 1991 .
[15] Jonathan Eckstein,et al. Nonlinear Proximal Point Algorithms Using Bregman Functions, with Applications to Convex Programming , 1993, Math. Oper. Res..
[16] Kurt Keutzer,et al. ADAHESSIAN: An Adaptive Second Order Optimizer for Machine Learning , 2020, ArXiv.
[17] Francis Bach,et al. On the Convergence of Adam and Adagrad , 2020, ArXiv.
[18] Li Shen,et al. Weighted AdaGrad with Unified Momentum , 2018 .
[19] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.
[20] H. Robbins. A Stochastic Approximation Method , 1951 .
[21] Marcello Federico,et al. Report on the 11th IWSLT evaluation campaign , 2014, IWSLT.
[22] John C. Duchi,et al. Asymptotic optimality in stochastic optimization , 2016, The Annals of Statistics.
[23] Aaron Defazio. Offset Sampling Improves Deep Learning based Accelerated MRI Reconstructions by Exploiting Symmetry , 2019 .
[24] Myle Ott,et al. fairseq: A Fast, Extensible Toolkit for Sequence Modeling , 2019, NAACL.
[25] Mehran Mesbahi,et al. Online distributed optimization via dual averaging , 2013, 52nd IEEE Conference on Decision and Control.
[26] Xiaoxia Wu,et al. AdaGrad stepsizes: Sharp convergence over nonconvex landscapes, from any initialization , 2018, ICML.
[27] Heinz H. Bauschke,et al. Legendre functions and the method of random Bregman projections , 1997 .
[28] Léon Bottou,et al. On the Ineffectiveness of Variance Reduced Optimization for Deep Learning , 2018, NeurIPS.
[29] Jorge Nocedal,et al. Optimization Methods for Large-Scale Machine Learning , 2016, SIAM Rev..
[30] Yurii Nesterov,et al. Introductory Lectures on Convex Optimization - A Basic Course , 2014, Applied Optimization.
[31] Zeyuan Allen Zhu,et al. Optimal Black-Box Reductions Between Optimization Objectives , 2016, NIPS.
[32] Aaron Defazio,et al. End-to-End Variational Networks for Accelerated MRI Reconstruction , 2020, MICCAI.
[33] Jingbo Zhu,et al. Learning Deep Transformer Models for Machine Translation , 2019, ACL.
[34] L. Bregman. The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming , 1967 .
[35] Andrew Zisserman,et al. Deep Frank-Wolfe For Neural Network Optimization , 2018, ICLR.
[36] Salim Roukos,et al. Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.
[37] Aaron Defazio. Understanding the Role of Momentum in Non-Convex Optimization: Practical Insights from a Lyapunov Analysis , 2020, ArXiv.
[38] Dmitriy Drusvyatskiy,et al. Stochastic model-based minimization of weakly convex functions , 2018, SIAM J. Optim..
[39] Richard Socher,et al. Pointer Sentinel Mixture Models , 2016, ICLR.
[40] Martin J. Wainwright,et al. Dual Averaging for Distributed Optimization: Convergence Analysis and Network Scaling , 2010, IEEE Transactions on Automatic Control.
[41] Omer Levy,et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.
[42] Michael G. Rabbat,et al. Push-Sum Distributed Dual Averaging for convex optimization , 2012, 2012 IEEE 51st IEEE Conference on Decision and Control (CDC).
[43] Yoshua Bengio,et al. Gradient-based learning applied to document recognition , 1998, Proc. IEEE.
[44] Niao He,et al. On the Convergence Rate of Stochastic Mirror Descent for Nonsmooth Nonconvex Optimization , 2018, 1806.04781.
[45] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.
[46] Marc Teboulle,et al. Entropic Proximal Mappings with Applications to Nonlinear Programming , 1992, Math. Oper. Res..
[47] Léon Bottou,et al. The Tradeoffs of Large Scale Learning , 2007, NIPS.
[48] S. Bos,et al. Using weight decay to optimize the generalization ability of a perceptron , 1996, Proceedings of International Conference on Neural Networks (ICNN'96).
[49] Zhisong Pan,et al. Primal Averaging: A New Gradient Evaluation Step to Attain the Optimal Individual Convergence , 2020, IEEE Transactions on Cybernetics.
[50] Francesco Orabona,et al. On the Convergence of Stochastic Gradient Descent with Adaptive Stepsizes , 2018, AISTATS.
[51] Jaehoon Lee,et al. On Empirical Comparisons of Optimizers for Deep Learning , 2019, ArXiv.
[52] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[53] Aaron Defazio,et al. On the convergence of the Stochastic Heavy Ball Method , 2020, ArXiv.
[54] Yoram Singer,et al. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..
[55] Shahin Shahrampour,et al. Exponentially fast parameter estimation in networks using distributed dual averaging , 2013, 52nd IEEE Conference on Decision and Control.