Simplifying Neural Networks by Soft Weight-Sharing

One way of simplifying neural networks so they generalize better is to add an extra term to the error function that will penalize complexity. Simple versions of this approach include penalizing the sum of the squares of the weights or penalizing the number of nonzero weights. We propose a more complicated penalty term in which the distribution of weight values is modeled as a mixture of multiple gaussians. A set of weights is simple if the weights have high probability density under the mixture model. This can be achieved by clustering the weights into subsets with the weights in each cluster having very similar values. Since we do not know the appropriate means or variances of the clusters in advance, we allow the parameters of the mixture model to adapt at the same time as the network learns. Simulations on two different problems demonstrate that this complexity term is more effective than previous complexity terms.

[1]  H. Jeffreys,et al.  Theory of probability , 1896 .

[2]  L. M. M.-T. Theory of Probability , 1929, Nature.

[3]  H. Akaike,et al.  Information Theory and an Extension of the Maximum Likelihood Principle , 1973 .

[4]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[5]  Teuvo Kohonen,et al.  Associative memory. A system-theoretical approach , 1977 .

[6]  J. Rissanen,et al.  Modeling By Shortest Data Description* , 1978, Autom..

[7]  H. Tong,et al.  Threshold Autoregression, Limit Cycles and Cyclical Data , 1980 .

[8]  D. B. Preston Spectral Analysis and Time Series , 1983 .

[9]  Geoffrey E. Hinton,et al.  Experiments on Learning by Back Propagation. , 1986 .

[10]  J. Justice Maximum entropy and bayesian methods in applied statistics , 1986 .

[11]  E. T. Jaynes,et al.  BAYESIAN METHODS: GENERAL BACKGROUND ? An Introductory Tutorial , 1986 .

[12]  Geoffrey E. Hinton Learning Translation Invariant Recognition in Massively Parallel Networks , 1987, PARLE.

[13]  Yann LeCun,et al.  Modeles connexionnistes de l'apprentissage , 1987 .

[14]  S. Gull Bayesian Inductive Inference and Maximum Entropy , 1988 .

[15]  M. B. Priestley,et al.  Non-linear and non-stationary time series analysis , 1990 .

[16]  Yann LeCun,et al.  Generalization and network design strategies , 1989 .

[17]  David Haussler,et al.  What Size Net Gives Valid Generalization? , 1989, Neural Computation.

[18]  J. Skilling Classic Maximum Entropy , 1989 .

[19]  Geoffrey E. Hinton,et al.  Learning distributed representations of concepts. , 1989 .

[20]  Hervé Bourlard,et al.  Generalization and Parameter Estimation in Feedforward Netws: Some Experiments , 1989, NIPS.

[21]  Stephen F. Gull,et al.  Developments in Maximum Entropy Data Analysis , 1989 .

[22]  Lawrence D. Jackel,et al.  Handwritten Digit Recognition with a Back-Propagation Network , 1989, NIPS.

[23]  Yann LeCun,et al.  Optimal Brain Damage , 1989, NIPS.

[24]  Michael C. Mozer,et al.  Using Relevance to Reduce Network Size Automatically , 1989 .

[25]  David E. Rumelhart,et al.  Predicting the Future: a Connectionist Approach , 1990, Int. J. Neural Syst..

[26]  Geoffrey E. Hinton,et al.  A time-delay neural network architecture for isolated word recognition , 1990, Neural Networks.

[27]  David E. Rumelhart,et al.  Generalization by Weight-Elimination with Application to Forecasting , 1990, NIPS.

[28]  Steven J. Nowlan,et al.  Soft competitive adaptation: neural network learning algorithms based on fitting statistical mixtures , 1991 .

[29]  홍재근,et al.  Time Delay Neural Network를 이용한 음성 인식 , 1991 .

[30]  David S. Touretzky,et al.  Connectionist models : proceedings of the 1990 summer school , 1991 .

[31]  Wray L. Buntine,et al.  Bayesian Back-Propagation , 1991, Complex Syst..

[32]  Trevor J. Hall,et al.  Optimal Network Construction by Minimum Description Length , 1993, Neural Computation.

[33]  Russell Reed,et al.  Pruning algorithms-a survey , 1993, IEEE Trans. Neural Networks.

[34]  Terry Caelli,et al.  Model-based neural networks , 1993, Neural Networks.

[35]  D.R. Hush,et al.  Progress in supervised neural networks , 1993, IEEE Signal Processing Magazine.

[36]  Subutai Ahmad David Touretzky, Jeffrey Elman, Terrence Sejnowski and Geoffrey Hinton, eds., Connectionist Models: Proceedings of the 1990 Summer School , 1993, Artif. Intell..

[37]  N. Intrator On the combination of supervised and unsupervised learning , 1993 .

[38]  Nathan Intrator,et al.  Combining Exploratory Projection Pursuit and Projection Pursuit Regression with Application to Neural Networks , 1993, Neural Computation.

[39]  Donald E. Waagen,et al.  Evolving recurrent perceptrons for time-series modeling , 1994, IEEE Trans. Neural Networks.

[40]  Thorsteinn S. Rögnvaldsson,et al.  JETNET 3.0—A versatile artificial neural network package , 1994 .

[41]  A. Lapedes,et al.  Nonlinear modeling and prediction by successive approximation using radial basis functions , 1994 .

[42]  Carsten Peterson,et al.  Finding the Embedding Dimension and Variable Dependencies in Time Series , 1994, Neural Computation.