New Gradient-Weighted Adaptive Gradient Methods With Dynamic Constraints

Existing adaptive gradient descent optimization algorithms such as adaptive gradient (Adagrad), adaptive moment estimation (Adam), and root mean square prop (RMSprop), increase the convergence speed by dynamically adjusting the learning rate. However, in some application scenarios, the generalization ability of these adaptive gradient descent optimization algorithms is inferior compared to stochastic gradient descent (SGD). To address this problem, several improved algorithms have been recently proposed, including adaptive mean square gradient (AMSGrad) and AdaBound. In this paper, we present new variants of AdaBound and AMSBound called GWDC (Adam with weighted gradient and dynamic bound of learning rate) and AMSGWDC (AMSGrad with weighted gradient and dynamic bound of learning rate) respectively. The proposed algorithms are developed on a dynamic decay rate method that can put more memory of the recent gradients in the first moment estimation. A theoretical proof of the convergence of the proposed algorithms is also presented. In order to verify the performance of GWDC and AMSGWDC, we compare them with other popular optimization in three well-known machine learning models, i.e., feedforward neural network, convolution neural network and gated recurrent unit network. Experimental results show that the generalization performance of our proposed algorithms is better than other optimization algorithms on test data, in addition, they also show better convergence speed.

[1]  Xu Sun,et al.  Adaptive Gradient Methods with Dynamic Bound of Learning Rate , 2019, ICLR.

[2]  Yi Liang,et al.  Deep Learning Based Inference of Private Information Using Embedded Sensors in Smart Devices , 2018, IEEE Network.

[3]  Ning Qian,et al.  On the momentum term in gradient descent learning algorithms , 1999, Neural Networks.

[4]  Yu Zheng,et al.  Exploring Spatio-Temporal Representations by Integrating Attention-based Bidirectional-LSTM-RNNs and FCNs for Speech Emotion Recognition , 2018, INTERSPEECH.

[5]  Yuanyuan Zhang,et al.  Attention Based Fully Convolutional Network for Speech Emotion Recognition , 2018, 2018 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC).

[6]  Mingyi Hong,et al.  On the Convergence of A Class of Adam-Type Algorithms for Non-Convex Optimization , 2018, ICLR.

[7]  Mourad Gridach,et al.  Character-level neural network for biomedical named entity recognition , 2017, J. Biomed. Informatics.

[8]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Zhenwei Shi,et al.  MugNet: Deep learning for hyperspectral image classification using limited samples , 2017, ISPRS Journal of Photogrammetry and Remote Sensing.

[10]  H. Robbins A Stochastic Approximation Method , 1951 .

[11]  Sanjiv Kumar,et al.  On the Convergence of Adam and Beyond , 2018 .

[12]  Andrew McCallum,et al.  Marginal Likelihood Training of BiLSTM-CRF for Biomedical Named Entity Recognition from Disjoint Label Sets , 2018, EMNLP.

[13]  Yunlong Yu,et al.  Dense Connectivity Based Two-Stream Deep Feature Fusion Framework for Aerial Scene Classification , 2018, Remote. Sens..

[14]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[15]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[16]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Nathan Srebro,et al.  The Marginal Value of Adaptive Gradient Methods in Machine Learning , 2017, NIPS.

[18]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[19]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[20]  John Moody,et al.  Learning rate schedules for faster stochastic gradient search , 1992, Neural Networks for Signal Processing II Proceedings of the 1992 IEEE Workshop.

[21]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[22]  Zhiming Luo,et al.  Spectral–Spatial Residual Network for Hyperspectral Image Classification: A 3-D Deep Learning Framework , 2018, IEEE Transactions on Geoscience and Remote Sensing.

[23]  Md. Zakirul Alam Bhuiyan,et al.  A Survey on Deep Learning in Big Data , 2017, 22017 IEEE International Conference on Computational Science and Engineering (CSE) and IEEE International Conference on Embedded and Ubiquitous Computing (EUC).

[24]  Min Chen,et al.  An adaptive deep Q-learning strategy for handwritten digit recognition , 2018, Neural Networks.

[25]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[26]  Surya Ganguli,et al.  Identifying and attacking the saddle point problem in high-dimensional non-convex optimization , 2014, NIPS.

[27]  Lukás Burget,et al.  Recurrent neural network based language model , 2010, INTERSPEECH.

[28]  Philipp Koehn,et al.  Exploring Word Sense Disambiguation Abilities of Neural Machine Translation Systems (Non-archival Extended Abstract) , 2018, AMTA.

[29]  Richard Socher,et al.  Improving Generalization Performance by Switching from Adam to SGD , 2017, ArXiv.

[30]  Lishuang Li,et al.  Recognizing Biomedical Named Entities Based on the Sentence Vector/Twin Word Embeddings Conditioned Bidirectional LSTM , 2016, CCL.

[31]  Yoshua Bengio,et al.  Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.