Nonmonotonic Generalization Bias of Gaussian Mixture Models

Theories of learning and generalization hold that the generalization bias, defined as the difference between the training error and the generalization error, increases on average with the number of adaptive parameters. This article, however, shows that this general tendency is violated for a gaussian mixture model. For temperatures just below the first symmetry breaking point, the effective number of adaptive parameters increases and the generalization bias decreases. We compute the dependence of the neural information criterion on temperature around the symmetry breaking. Our results are confirmed by numerical cross-validation experiments.

[1]  Michael I. Jordan,et al.  Advances in Neural Information Processing Systems 30 , 1995 .

[2]  Shun-ichi Amari,et al.  A universal theorem on learning curves , 1993, Neural Networks.

[3]  Robert A. Jacobs,et al.  Hierarchical Mixtures of Experts and the EM Algorithm , 1993, Neural Computation.

[4]  Michael I. Jordan,et al.  A Competitive Modular Connectionist Architecture , 1990, NIPS.

[5]  Hilbert J. Kappen Using Boltzmann Machines for probability estimation: A general framework for neural network learning , 1994 .

[6]  J. Rissanen Stochastic Complexity and Modeling , 1986 .

[7]  Rose,et al.  Statistical mechanics and phase transitions in clustering. , 1990, Physical review letters.

[8]  Hilbert J. Kappen Deterministic learning rules for boltzmann machines , 1995, Neural Networks.

[9]  Hilbert J. Kappen,et al.  Symmetry Breaking and Training from Incomplete Data with Radial Basis Boltzmann Machines , 1997, Int. J. Neural Syst..

[10]  Shun-ichi Amari,et al.  Network information criterion-determining the number of hidden units for an artificial neural network model , 1994, IEEE Trans. Neural Networks.

[11]  Sompolinsky,et al.  Scaling laws in learning of classification tasks. , 1993, Physical review letters.

[12]  A. F. Smith,et al.  Statistical analysis of finite mixture distributions (book review) , 1986 .

[13]  James C. Bezdek,et al.  A Convergence Theorem for the Fuzzy ISODATA Clustering Algorithms , 1980, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[14]  John E. Moody,et al.  The Effective Number of Parameters: An Analysis of Generalization and Regularization in Nonlinear Learning Systems , 1991, NIPS.

[15]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[16]  F. Girosi,et al.  Networks for approximation and learning , 1990, Proc. IEEE.

[17]  A. F. Smith,et al.  Statistical analysis of finite mixture distributions , 1986 .