Eva: A General Vectorized Approximation Framework for Second-order Optimization
暂无分享,去创建一个
[1] Percy Liang,et al. Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training , 2023, ArXiv.
[2] S. Shi,et al. Accelerating Distributed K-FAC with Efficient Collective Communication and Scheduling , 2023, IEEE INFOCOM 2023 - IEEE Conference on Computer Communications.
[3] S. Shi,et al. Scalable K-FAC Training for Deep Neural Networks With Distributed Preconditioning , 2022, IEEE Transactions on Cloud Computing.
[4] Jeff Z. HaoChen,et al. Amortized Proximal Optimization , 2022, NeurIPS.
[5] Frederik Benzing. Gradient Descent on Neurons and its Link to Approximate Second-Order Optimization , 2022, ICML.
[6] Dan Alistarh,et al. M-FAC: Efficient Matrix-Free Approximations of Second-Order Information , 2021, NeurIPS.
[7] Kyle Chard,et al. KAISA: An Adaptive Second-Order Optimizer Framework for Deep Neural Networks , 2021, SC21: International Conference for High Performance Computing, Networking, Storage and Analysis.
[8] Shaohuai Shi,et al. Accelerating Distributed K-FAC with Smart Parallelism of Computing and Communication Tasks , 2021, 2021 IEEE 41st International Conference on Distributed Computing Systems (ICDCS).
[9] Donald Goldfarb,et al. Tensor Normal Training for Deep Learning Models , 2021, NeurIPS.
[10] Dan Alistarh,et al. Communication-Efficient Distributed Optimization with Quantized Preconditioners , 2021, ICML.
[11] Rio Yokota,et al. Rich Information is Affordable: A Systematic Performance Analysis of Second-order Optimization Using K-FAC , 2020, KDD.
[12] Ian T. Foster,et al. Convolutional Neural Network Training with Distributed K-FAC , 2020, SC20: International Conference for High Performance Computing, Networking, Storage and Analysis.
[13] Donald Goldfarb,et al. Practical Quasi-Newton Methods for Training Deep Neural Networks , 2020, NeurIPS.
[14] Kurt Keutzer,et al. ADAHESSIAN: An Adaptive Second Order Optimizer for Machine Learning , 2020, AAAI.
[15] Chuan-Sheng Foo,et al. Scalable and Practical Natural Gradient for Large-Scale Deep Learning , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.
[16] M. Shoeybi,et al. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism , 2019, ArXiv.
[17] Quoc V. Le,et al. AutoAugment: Learning Augmentation Strategies From Data , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[18] Seong Joon Oh,et al. CutMix: Regularization Strategy to Train Strong Classifiers With Localizable Features , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[19] Satoshi Matsuoka,et al. Large-Scale Distributed Second-Order Optimization Using Kronecker-Factored Approximate Curvature for Deep Convolutional Neural Networks , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[20] Quoc V. Le,et al. GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism , 2018, NeurIPS.
[21] John C. Duchi,et al. Stochastic (Approximate) Proximal Point Methods: Convergence, Optimality, and Adaptivity , 2018, SIAM J. Optim..
[22] Yoram Singer,et al. Shampoo: Preconditioned Stochastic Tensor Optimization , 2018, ICML.
[23] Alexander Sergeev,et al. Horovod: fast and easy distributed deep learning in TensorFlow , 2018, ArXiv.
[24] Frank Hutter,et al. Decoupled Weight Decay Regularization , 2017, ICLR.
[25] Elad Hoffer,et al. Train longer, generalize better: closing the generalization gap in large batch training of neural networks , 2017, NIPS.
[26] Jorge Nocedal,et al. Optimization Methods for Large-Scale Machine Learning , 2016, SIAM Rev..
[27] Nikos Komodakis,et al. Wide Residual Networks , 2016, BMVC.
[28] Sergey Ioffe,et al. Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning , 2016, AAAI.
[29] Roger B. Grosse,et al. A Kronecker-factored approximate Fisher matrix for convolution layers , 2016, ICML.
[30] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[31] Andrea Montanari,et al. Convergence rates of sub-sampled Newton methods , 2015, NIPS.
[32] Roger B. Grosse,et al. Optimizing Neural Networks with Kronecker-factored Approximate Curvature , 2015, ICML.
[33] Ya-Xiang Yuan,et al. Recent advances in trust region algorithms , 2015, Mathematical Programming.
[34] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.
[35] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.
[36] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.
[37] Yoram Singer,et al. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..
[38] James Martens,et al. Deep learning via Hessian-free optimization , 2010, ICML.
[39] Fei-Fei Li,et al. ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.
[40] Shun-ichi Amari,et al. Natural Gradient Works Efficiently in Learning , 1998, Neural Computation.
[41] John E. Dennis,et al. Numerical methods for unconstrained optimization and nonlinear equations , 1983, Prentice Hall series in computational mathematics.
[42] J. Sherman,et al. Adjustment of an Inverse Matrix Corresponding to a Change in One Element of a Given Matrix , 1950 .
[43] S. Shi,et al. Eva: Practical Second-order Optimization with Kronecker-vectorized Approximation , 2023, International Conference on Learning Representations.
[44] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.
[45] Alex Krizhevsky,et al. Learning Multiple Layers of Features from Tiny Images , 2009 .
[46] L. Eon Bottou. Online Learning and Stochastic Approximations , 1998 .