Novel Maximum-Margin Training Algorithms for Supervised Neural Networks

This paper proposes three novel training methods, two of them based on the backpropagation approach and a third one based on information theory for multilayer perceptron (MLP) binary classifiers. Both backpropagation methods are based on the maximal-margin (MM) principle. The first one, based on the gradient descent with adaptive learning rate algorithm (GDX) and named maximum-margin GDX (MMGDX), directly increases the margin of the MLP output-layer hyperplane. The proposed method jointly optimizes both MLP layers in a single process, backpropagating the gradient of an MM-based objective function, through the output and hidden layers, in order to create a hidden-layer space that enables a higher margin for the output-layer hyperplane, avoiding the testing of many arbitrary kernels, as occurs in case of support vector machine (SVM) training. The proposed MM-based objective function aims to stretch out the margin to its limit. An objective function based on Lp-norm is also proposed in order to take into account the idea of support vectors, however, overcoming the complexity involved in solving a constrained optimization problem, usually in SVM training. In fact, all the training methods proposed in this paper have time and space complexities O(N) while usual SVM training methods have time complexity O(N 3) and space complexity O(N 2) , where N is the training-data-set size. The second approach, named minimization of interclass interference (MICI), has an objective function inspired on the Fisher discriminant analysis. Such algorithm aims to create an MLP hidden output where the patterns have a desirable statistical distribution. In both training methods, the maximum area under ROC curve (AUC) is applied as stop criterion. The third approach offers a robust training framework able to take the best of each proposed training method. The main idea is to compose a neural model by using neurons extracted from three other neural networks, each one previously trained by MICI, MMGDX, and Levenberg-Marquard (LM), respectively. The resulting neural network was named assembled neural network (ASNN). Benchmark data sets of real-world problems have been used in experiments that enable a comparison with other state-of-the-art classifiers. The results provide evidence of the effectiveness of our methods regarding accuracy, AUC, and balanced error rate.

[1]  Ju-Jang Lee,et al.  Training Two-Layered Feedforward Networks With Variable Projection Method , 2008, IEEE Transactions on Neural Networks.

[2]  M. J. Usher Applications of Information Theory , 1984 .

[3]  Bernhard Sendhoff,et al.  Neural network regularization and ensembling using multi-objective evolutionary algorithms , 2004, Proceedings of the 2004 Congress on Evolutionary Computation (IEEE Cat. No.04TH8753).

[4]  Suresh Chandra,et al.  Binary classification by SVM based tree type neural networks , 2002, Proceedings of the 2002 International Joint Conference on Neural Networks. IJCNN'02 (Cat. No.02CH37290).

[5]  Aaas News,et al.  Book Reviews , 1893, Buffalo Medical and Surgical Journal.

[6]  Dale Schuurmans,et al.  Maximum Margin Clustering , 2004, NIPS.

[7]  S. Heinrich,et al.  Fast obstacle detection for urban traffic situations , 2002, IEEE Trans. Intell. Transp. Syst..

[8]  Tamás D. Gedeon,et al.  Exploring constructive cascade networks , 1999, IEEE Trans. Neural Networks.

[9]  P. Corral,et al.  Optimization of ANN applied to non-linear system identification based in UWB , 2006, Joint IST Workshop on Mobile Future, 2006 and the Symposium on Trends in Communications. SympoTIC '06..

[10]  Yi Lu Murphey,et al.  Multi-class pattern classification using neural networks , 2007, Pattern Recognit..

[11]  Gunnar Rätsch,et al.  Soft Margins for AdaBoost , 2001, Machine Learning.

[12]  Pedro E. López-de-Teruel,et al.  Nonlinear kernel-based statistical pattern analysis , 2001, IEEE Trans. Neural Networks.

[13]  Masoud Nikravesh,et al.  Feature Extraction: Foundations and Applications (Studies in Fuzziness and Soft Computing) , 2006 .

[14]  Jian Yang,et al.  KPCA plus LDA: a complete kernel Fisher discriminant framework for feature extraction and recognition , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  Herman Augusto Lepikson,et al.  Applications of information theory, genetic algorithms, and neural models to predict oil flow , 2009 .

[16]  Radford M. Neal,et al.  High Dimensional Classification with Bayesian Neural Networks and Dirichlet Diffusion Trees , 2006, Feature Extraction.

[17]  Shigeo Abe,et al.  Maximizing margins of multilayer neural networks , 2002, Proceedings of the 9th International Conference on Neural Information Processing, 2002. ICONIP '02..

[18]  Robert Kozma,et al.  Beyond Feedforward Models Trained by Backpropagation: A Practical Training Tool for a More Efficient Universal Approximator , 2007, IEEE Transactions on Neural Networks.

[19]  Ivor W. Tsang,et al.  Core Vector Machines: Fast SVM Training on Very Large Data Sets , 2005, J. Mach. Learn. Res..

[20]  Peter Tino,et al.  IEEE Transactions on Neural Networks , 2009 .

[21]  Ingo Steinwart,et al.  Sparseness of Support Vector Machines , 2003, J. Mach. Learn. Res..

[22]  Kurt Hornik,et al.  Universal approximation of an unknown mapping and its derivatives using multilayer feedforward networks , 1990, Neural Networks.

[23]  Randall S. Sexton,et al.  Optimization of neural networks: A comparative analysis of the genetic algorithm and simulated annealing , 1999, Eur. J. Oper. Res..

[24]  Fuhui Long,et al.  Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy , 2003, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[25]  Masoud Nikravesh,et al.  Feature Extraction - Foundations and Applications , 2006, Feature Extraction.

[26]  Ivor W. Tsang,et al.  Maximum Margin Clustering Made Practical , 2007, IEEE Transactions on Neural Networks.

[27]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[28]  Chengjun Liu,et al.  Gabor-based kernel PCA with fractional power polynomial models for face recognition , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[29]  Tom Downs,et al.  CARVE-a constructive algorithm for real-valued examples , 1998, IEEE Trans. Neural Networks.

[30]  Vladimir Vapnik,et al.  Chervonenkis: On the uniform convergence of relative frequencies of events to their probabilities , 1971 .