Several second-order optimization methods for gradient descent algorithms have been proposed over the years, but they usually need to compute the inverse of the Hessian of the cost function (or an approximation of this inverse) during training. In most cases, this leads to an O(n/sup 2/) cost in time and space per iteration, where n is the number of parameters, which is prohibitive for large n. We propose instead a study of the Hessian before training. Based on a second order analysis, we show that a block-diagonal Hessian yields an easier optimization problem than a full Hessian. We also show that the condition of block-diagonality in common machine learning models can be achieved by simply selecting an appropriate training criterion. Finally, we propose a version of the SVM criterion applied to MLPs, which verifies the aspects highlighted in this second order analysis, but also yields very good generalization performance in practice, taking advantage of the margin effect. Several empirical comparisons on two benchmark datasets are given to illustrate this approach.
[1]
Decision Boundary Focused Neural Network Classifier
,
2022
.
[2]
Kurt Hornik,et al.
Multilayer feedforward networks are universal approximators
,
1989,
Neural Networks.
[3]
Samy Bengio,et al.
A New Margin-Based Criterion for Efficient Gradient Descent
,
2003
.
[4]
Heekuck Oh,et al.
Neural Networks for Pattern Recognition
,
1993,
Adv. Comput..
[5]
Peter Auer,et al.
Reducing Communication for Distributed Learning in Neural Networks
,
2002,
ICANN.
[6]
Richard O. Duda,et al.
Pattern classification and scene analysis
,
1974,
A Wiley-Interscience publication.
[7]
Roberto Battiti,et al.
First- and Second-Order Methods for Learning: Between Steepest Descent and Newton's Method
,
1992,
Neural Computation.
[8]
Vladimir N. Vapnik,et al.
The Nature of Statistical Learning Theory
,
2000,
Statistics for Engineering and Information Science.
[9]
Vladimir Vapnik,et al.
The Nature of Statistical Learning
,
1995
.