Guided Stochastic Gradient Descent Algorithm for inconsistent datasets

Abstract Stochastic Gradient Descent (SGD) Algorithm, despite its simplicity, is considered an effective and default standard optimization algorithm for machine learning classification models such as neural networks and logistic regression. However, SGD’s gradient descent is biased towards the random selection of a data instance. In this paper, it has been termed as data inconsistency. The proposed variation of SGD, Guided Stochastic Gradient Descent (GSGD) Algorithm, tries to overcome this inconsistency in a given dataset through greedy selection of consistent data instances for gradient descent. The empirical test results show the efficacy of the method. Moreover, GSGD has also been incorporated and tested with other popular variations of SGD, such as Adam, Adagrad and Momentum. The guided search with GSGD achieves better convergence and classification accuracy in a limited time budget than its original counterpart of canonical and other variation of SGD. Additionally, it maintains the same efficiency when experimented on medical benchmark datasets with logistic regression for classification.

[1]  Franz Rothlauf,et al.  Design of Modern Heuristics: Principles and Application , 2011 .

[2]  Hsuan-Tien Lin,et al.  Learning From Data , 2012 .

[3]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[4]  Christopher M. Bishop,et al.  Neural networks for pattern recognition , 1995 .

[5]  H. E. Solberg,et al.  Detection of outliers in reference distributions: performance of Horn's algorithm. , 2005, Clinical chemistry.

[6]  James T. Kwok,et al.  Follow the Moving Leader in Deep Learning , 2017, ICML.

[7]  Sebastian Ruder,et al.  An overview of gradient descent optimization algorithms , 2016, Vestnik komp'iuternykh i informatsionnykh tekhnologii.

[8]  Madhu Chetty,et al.  A new guided genetic algorithm for 2D hydrophobic-hydrophilic model to predict protein folding , 2005, 2005 IEEE Congress on Evolutionary Computation.

[9]  Y. Nesterov A method for unconstrained convex minimization problem with the rate of convergence o(1/k^2) , 1983 .

[10]  Marc'Aurelio Ranzato,et al.  Large Scale Distributed Deep Networks , 2012, NIPS.

[11]  Nenghai Yu,et al.  Asynchronous Stochastic Gradient Descent with Delay Compensation , 2016, ICML.

[12]  Tong Zhang,et al.  Accelerating Stochastic Gradient Descent using Predictive Variance Reduction , 2013, NIPS.

[13]  Dr. Zbigniew Michalewicz,et al.  How to Solve It: Modern Heuristics , 2004 .

[14]  Bin-Da Liu,et al.  A backpropagation algorithm with adaptive learning rate and momentum coefficient , 2002, Proceedings of the 2002 International Joint Conference on Neural Networks. IJCNN'02 (Cat. No.02CH37290).

[15]  Jie Liu,et al.  Mini-Batch Semi-Stochastic Gradient Descent in the Proximal Setting , 2015, IEEE Journal of Selected Topics in Signal Processing.

[16]  Emile Fiesler,et al.  Neural Networks with Adaptive Learning Rate and Momentum Terms , 1995 .

[17]  Matthew D. Zeiler ADADELTA: An Adaptive Learning Rate Method , 2012, ArXiv.

[18]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[19]  VARUN CHANDOLA,et al.  Anomaly detection: A survey , 2009, CSUR.

[20]  Xi Chen,et al.  Variance Reduction for Stochastic Gradient Optimization , 2013, NIPS.

[21]  Peter Norvig,et al.  Artificial Intelligence: A Modern Approach , 1995 .

[22]  Tony R. Martinez,et al.  The general inefficiency of batch training for gradient descent learning , 2003, Neural Networks.

[23]  Martti Juhola,et al.  Informal identification of outliers in medical data , 2000 .

[24]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[25]  Ning Qian,et al.  On the momentum term in gradient descent learning algorithms , 1999, Neural Networks.