Unsupervised Mutual Information Criterion for Elimination of Overtraining in Supervised Multilayer Networks

Controlling the network complexity in order to prevent overfitting is one of the major problems encountered when using neural network models to extract the structure from small data sets. In this paper we present a network architecture designed for use with a cost function that includes a novel complexity penalty term. In this architecture the outputs of the hidden units are strictly positive and sum to one, and their outputs are defined as the probability that the actual input belongs to a certain class formed during learning. The penalty term expresses the mutual information between the inputs and the extracted classes. This measure effectively describes the network complexity with respect to the given data in an unsupervised fashion. The efficiency of this architecture/penalty-term when combined with backpropagation training, is demonstrated on a real world economic time series forecasting problem. The model was also applied to the benchmark sunspot data and to a synthetic data set from the statistics community.

[1]  Vladimir Vapnik,et al.  Principles of Risk Minimization for Learning Theory , 1991, NIPS.

[2]  David J. C. MacKay,et al.  A Practical Bayesian Framework for Backpropagation Networks , 1992, Neural Computation.

[3]  Geoffrey E. Hinton,et al.  Adaptive Soft Weight Tying using Gaussian Mixtures , 1991, NIPS.

[4]  Gustavo Deco,et al.  Coarse Coding Resource-Allocating Network , 1993, Neural Computation.

[5]  A. Norman Redlich,et al.  Redundancy Reduction as a Strategy for Unsupervised Learning , 1993, Neural Computation.

[6]  H. Tong,et al.  Threshold Autoregression, Limit Cycles and Cyclical Data , 1980 .

[7]  J. Freidman,et al.  Multivariate adaptive regression splines , 1991 .

[8]  Helen Suzanna Becker,et al.  An information-theoretic unsupervised learning algorithm for neural networks , 1993 .

[9]  T.W.S. Chow,et al.  Noise robustness enhancement using fourth-order cumulants cost function , 1996, Proceedings of International Conference on Neural Networks (ICNN'96).

[10]  Andreas S. Weigend,et al.  The Future of Time Series: Learning and Understanding , 1993 .

[11]  David J. C. MacKay,et al.  Unsupervised Classifiers, Mutual Information and 'Phantom Targets' , 1991, NIPS.

[12]  Lars Kai Hansen,et al.  On design and evaluation of tapped-delay neural network architectures , 1993, IEEE International Conference on Neural Networks.

[13]  Leo Howe,et al.  Predicting the Future , 1993 .

[14]  John S. Bridle,et al.  Training Stochastic Model Recognition Algorithms as Networks can Lead to Maximum Mutual Information Estimation of Parameters , 1989, NIPS.

[15]  Isabelle Guyon,et al.  Structural Risk Minimization for Character Recognition , 1991, NIPS.

[16]  John E. Moody,et al.  The Effective Number of Parameters: An Analysis of Generalization and Regularization in Nonlinear Learning Systems , 1991, NIPS.

[17]  Christian Lebiere,et al.  The Cascade-Correlation Learning Architecture , 1989, NIPS.

[18]  David E. Rumelhart,et al.  Generalization by Weight-Elimination with Application to Forecasting , 1990, NIPS.

[19]  Yann LeCun,et al.  Optimal Brain Damage , 1989, NIPS.

[20]  Stephen A. Billings,et al.  Radial Basis Function Network Configuration Using Mutual Information and the Orthogonal Least Squares Algorithm , 1996, Neural Networks.

[21]  W. Finnoff,et al.  Detecting structure in small datasets by network fitting under complexity constraints , 1994, COLT 1994.

[22]  Ralph Linsker,et al.  How to Generate Ordered Maps by Maximizing the Mutual Information between Input and Output Signals , 1989, Neural Computation.

[23]  D. Rumelhart,et al.  The effective dimension of the space of hidden units , 1991, [Proceedings] 1991 IEEE International Joint Conference on Neural Networks.

[24]  David E. Rumelhart,et al.  Predicting the Future: a Connectionist Approach , 1990, Int. J. Neural Syst..

[25]  Ralph Linsker,et al.  Local Synaptic Learning Rules Suffice to Maximize Mutual Information in a Linear Network , 1992, Neural Computation.

[26]  C. E. SHANNON,et al.  A mathematical theory of communication , 1948, MOCO.

[27]  John Moody,et al.  Fast Learning in Networks of Locally-Tuned Processing Units , 1989, Neural Computation.

[28]  Lorien Y. Pratt,et al.  Comparing Biases for Minimal Network Construction with Back-Propagation , 1988, NIPS.