Neural Networks: A Review from a Statistical Perspective

This paper informs a statistical readership about Artificial Neural Networks (ANNs), points out some of the links with statistical methodology and encourages cross-disciplinary research in the directions most likely to bear fruit. The areas of statistical interest are briefly outlined, and a series of examples indicates the flavor of ANN models. We then treat various topics in more depth. In each case, we describe the neural network architectures and training rules and provide a statistical commentary. The topics treated in this way are perceptrons (from single-unit to multilayer versions), Hopfield-type recurrent networks (including probabilistic versions strongly related to statistical physics and Gibbs distributions) and associative memory networks trained by so-called unsupervised learning rules. Perceptrons are shown to have strong associations with discriminant analysis and regression, and unsupervized networks with cluster analysis. The paper concludes with some thoughts on the future of the interface between neural networks and statistics.

[1]  R. Fisher THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS , 1936 .

[2]  G. Lorentz Approximation of Functions , 1966 .

[3]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[4]  M. R. Mickey,et al.  Estimation of Error Rates in Discriminant Analysis , 1968 .

[5]  V. Fabian On Asymptotic Normality in Stochastic Approximation , 1968 .

[6]  Arthur E. Bryson,et al.  Applied Optimal Control , 1969 .

[7]  D. R. Cox,et al.  The analysis of binary data , 1971 .

[8]  H. Sorenson,et al.  Recursive bayesian estimation using gaussian sums , 1971 .

[9]  H. Akaike,et al.  Information Theory and an Extension of the Maximum Likelihood Principle , 1973 .

[10]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[11]  P. Werbos,et al.  Beyond Regression : "New Tools for Prediction and Analysis in the Behavioral Sciences , 1974 .

[12]  W. Little The existence of persistent states in the brain , 1974 .

[13]  John A. Hartigan,et al.  Clustering Algorithms , 1975 .

[14]  P. Holland,et al.  Discrete Multivariate Analysis. , 1976 .

[15]  C. Malsburg,et al.  How patterned neural connections can be set up by self-organization , 1976, Proceedings of the Royal Society of London. Series B. Biological Sciences.

[16]  M. Stone,et al.  Cross‐Validatory Choice and Assessment of Statistical Predictions , 1976 .

[17]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[18]  J. Hartigan Asymptotic Distributions for Clustering Criteria , 1978 .

[19]  Peter Craven,et al.  Smoothing noisy data with spline functions , 1978 .

[20]  Robert F. Ling,et al.  Classification and Clustering. , 1979 .

[21]  Robert M. Gray,et al.  An Algorithm for Vector Quantizer Design , 1980, IEEE Trans. Commun..

[22]  D. Spiegelhalter,et al.  Bayes Factors and Choice Criteria for Linear Models , 1980 .

[23]  R. Shibata An optimal selection of regression variables , 1981 .

[24]  David J. Hand,et al.  Discrimination and Classification , 1982 .

[25]  D. Pollard Strong Consistency of $K$-Means Clustering , 1981 .

[26]  D. Titterington,et al.  Comparison of Discrimination Techniques Applied to a Complex Data Set of Head Injured Patients , 1981 .

[27]  E. Oja Simplified neuron model as a principal component analyzer , 1982, Journal of mathematical biology.

[28]  Allen Gersho,et al.  On the structure of vector quantizers , 1982, IEEE Trans. Inf. Theory.

[29]  J J Hopfield,et al.  Neural networks and physical systems with emergent collective computational abilities. , 1982, Proceedings of the National Academy of Sciences of the United States of America.

[30]  D. Pollard A Central Limit Theorem for $k$-Means Clustering , 1982 .

[31]  David Pollard,et al.  Quantization and the method of k -means , 1982, IEEE Trans. Inf. Theory.

[32]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[33]  J. Ross Quinlan,et al.  Learning Efficient Classification Procedures and Their Application to Chess End Games , 1983 .

[34]  J J Hopfield,et al.  Neurons with graded response have collective computational properties like those of two-state neurons. , 1984, Proceedings of the National Academy of Sciences of the United States of America.

[35]  Donald Geman,et al.  Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images , 1984, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[36]  D. M. Titterington,et al.  Comments on "Application of the Conditional Population-Mixture Model to Image Segmentation" , 1984, IEEE Trans. Pattern Anal. Mach. Intell..

[37]  D. Titterington Common structure of smoothing techniques in statistics , 1985 .

[38]  Geoffrey E. Hinton,et al.  A Learning Algorithm for Boltzmann Machines , 1985, Cogn. Sci..

[39]  Sompolinsky,et al.  Storing infinite numbers of patterns in a spin-glass model of neural networks. , 1985, Physical review letters.

[40]  David Zipser,et al.  Feature Discovery by Competive Learning , 1986, Cogn. Sci..

[41]  D. Rumelhart Learning internal representations by back-propagating errors , 1986 .

[42]  B. Silverman Density estimation for statistics and data analysis , 1986 .

[43]  Paul Smolensky,et al.  Information processing in dynamical systems: foundations of harmony theory , 1986 .

[44]  A. F. Smith,et al.  Statistical analysis of finite mixture distributions , 1986 .

[45]  Geoffrey E. Hinton,et al.  Learning and relearning in Boltzmann machines , 1986 .

[46]  R. Tibshirani,et al.  Generalized additive models for medical research , 1986, Statistical methods in medical research.

[47]  S. Duane,et al.  Hybrid Monte Carlo , 1987 .

[48]  Richard P. Lippmann,et al.  An introduction to computing with neural nets , 1987 .

[49]  A. Lapedes,et al.  Nonlinear Signal Processing Using Neural Networks , 1987 .

[50]  Elie Bienenstock,et al.  A neural network for invariant pattern recognition. , 1987 .

[51]  Santosh S. Venkatesh,et al.  The capacity of the Hopfield associative memory , 1987, IEEE Trans. Inf. Theory.

[52]  Terrence J. Sejnowski,et al.  Parallel Networks that Learn to Pronounce English Text , 1987, Complex Syst..

[53]  Charles M. Newman,et al.  Memory capacity in neural network models: Rigorous lower bounds , 1988, Neural Networks.

[54]  Geoffrey J. McLachlan,et al.  Mixture models : inference and applications to clustering , 1989 .

[55]  David J. Spiegelhalter,et al.  Local computations with probabilities on graphical structures and their application to expert systems , 1990 .

[56]  James A. Anderson,et al.  Neurocomputing: Foundations of Research , 1988 .

[57]  Bernard Widrow,et al.  Adaptive switching circuits , 1988 .

[58]  David Lowe,et al.  A Comparison of Nonlinear Optimisation Strategies for Feed-Forward Adaptive Layered Networks , 1988 .

[59]  Teuvo Kohonen Optical Associative Memories , 1988 .

[60]  János Komlós,et al.  Convergence results in an associative memory model , 1988, Neural Networks.

[61]  Stephen Grossberg,et al.  The ART of adaptive pattern recognition by a self-organizing neural network , 1988, Computer.

[62]  Esther Levin,et al.  A statistical approach to learning and generalization in layered neural networks , 1989, Proc. IEEE.

[63]  H. White Some Asymptotic Results for Learning in Single Hidden-Layer Feedforward Network Models , 1989 .

[64]  John Moody,et al.  Fast Learning in Networks of Locally-Tuned Processing Units , 1989, Neural Computation.

[65]  John Scott Bridle,et al.  Probabilistic Interpretation of Feedforward Classification Network Outputs, with Relationships to Statistical Pattern Recognition , 1989, NATO Neurocomputing.

[66]  David Haussler,et al.  What Size Net Gives Valid Generalization? , 1989, Neural Computation.

[67]  R. Gray Source Coding Theory , 1989 .

[68]  Geoffrey E. Hinton,et al.  Learning distributed representations of concepts. , 1989 .

[69]  J. Friedman Regularized Discriminant Analysis , 1989 .

[70]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[71]  C. Campell,et al.  Statistical mechanics and neural networks , 1989 .

[72]  Geoffrey E. Hinton Connectionist Learning Procedures , 1989, Artif. Intell..

[73]  Lawrence D. Jackel,et al.  Backpropagation Applied to Handwritten Zip Code Recognition , 1989, Neural Computation.

[74]  Martin Casdagli,et al.  Nonlinear prediction of chaotic time series , 1989 .

[75]  Teuvo Kohonen,et al.  Self-Organization and Associative Memory, Third Edition , 1989, Springer Series in Information Sciences.

[76]  Kurt Hornik,et al.  Multilayer feedforward networks are universal approximators , 1989, Neural Networks.

[77]  F. Girosi,et al.  Networks for approximation and learning , 1990, Proc. IEEE.

[78]  W. Pitts,et al.  A Logical Calculus of the Ideas Immanent in Nervous Activity (1943) , 2021, Ideas That Created the Future.

[79]  Donald F. Specht,et al.  Probabilistic neural networks , 1990, Neural Networks.

[80]  G. Wahba Spline models for observational data , 1990 .

[81]  John A. Hertz,et al.  Exploiting Neurons with Localized Receptive Fields to Learn Chaos , 1990, Complex Syst..

[82]  Halbert White,et al.  Connectionist nonparametric regression: Multilayer feedforward networks can learn arbitrary mappings , 1990, Neural Networks.

[83]  Taylor,et al.  Random iterative networks. , 1990, Physical review. A, Atomic, molecular, and optical physics.

[84]  Peter J. Gawthrop,et al.  Stochastic Approximation and Multilayer Perceptrons: The Gain Backpropagation Algorithm , 1990, Complex Syst..

[85]  Hervé Bourlard HOW CONNECTIONIST MODELS COULD IMPROVE MARKOV MODELS FOR SPEECH RECOGNITION , 1990 .

[86]  D. Lowe,et al.  Exploiting prior knowledge in network optimization: an illustration from medical prognosis , 1990 .

[87]  J. N. R. Jeffers,et al.  Graphical Models in Applied Multivariate Statistics. , 1990 .

[88]  Shun-ichi Amari,et al.  Mathematical foundations of neurocomputing , 1990, Proc. IEEE.

[89]  David J. Spiegelhalter,et al.  Sequential updating of conditional probabilities on directed graphical structures , 1990, Networks.

[90]  D. Titterington Some recent research in the analysis of mixture distributions , 1990 .

[91]  Hans G. C. Tråvén,et al.  A neural network approach to statistical pattern classification by 'semiparametric' estimation of probability density functions , 1991, IEEE Trans. Neural Networks.

[92]  P. Whittle Neural Nets and Implicit Inference , 1991 .

[93]  Donald F. Specht,et al.  A general regression neural network , 1991, IEEE Trans. Neural Networks.

[94]  U. Kressel The Impact of the Learning–Set Size in Handwritten–Digit Recognition , 1991 .

[95]  Stephen P. Luttrell Code vector density in topographic mappings: Scalar case , 1991, IEEE Trans. Neural Networks.

[96]  Anders Krogh,et al.  Introduction to the theory of neural computation , 1994, The advanced book program.

[97]  John E. Moody,et al.  The Effective Number of Parameters: An Analysis of Generalization and Regularization in Nonlinear Learning Systems , 1991, NIPS.

[98]  Robert J. Marks,et al.  Layered perceptron versus Neyman-Pearson optimal detection , 1991, [Proceedings] 1991 IEEE International Joint Conference on Neural Networks.

[99]  Richard Lippmann,et al.  Neural Network Classifiers Estimate Bayesian a posteriori Probabilities , 1991, Neural Computation.

[100]  Andrew R. Barron,et al.  Complexity Regularization with Application to Artificial Neural Networks , 1991 .

[101]  A. Barron Approximation and Estimation Bounds for Artificial Neural Networks , 1991, COLT '91.

[102]  P. Tavan,et al.  A NETWORK FOR DISCRIMINANT ANALYSIS , 1991 .

[103]  A. Gallant,et al.  Finding Chaos in Noisy Systems , 1992 .

[104]  William J. Byrne,et al.  Alternating minimization and Boltzmann machine learning , 1992, IEEE Trans. Neural Networks.

[105]  G. McLachlan Discriminant Analysis and Statistical Pattern Recognition , 1992 .

[106]  Peter J. Gawthrop,et al.  Neural networks for control systems - A survey , 1992, Autom..

[107]  George Cybenko,et al.  Approximation by superpositions of a sigmoidal function , 1992, Math. Control. Signals Syst..

[108]  Minoru Fukumi,et al.  Rotation-invariant neural pattern recognition system with application to coin recognition , 1992, IEEE Trans. Neural Networks.

[109]  K. Roeder,et al.  Residual diagnostics for mixture models , 1992 .

[110]  John S. Bridle,et al.  Neural Networks or Hidden Markov Models for Automatic Speech Recognition: Is there a Choice? , 1992 .

[111]  Shun-ichi Amari,et al.  Information geometry of Boltzmann machines , 1992, IEEE Trans. Neural Networks.

[112]  Stefan Bornholdt,et al.  General asymmetric neural networks and structure design by genetic algorithms: a learning rule for temporal patterns , 1992, Proceedings of IEEE Systems Man and Cybernetics Conference - SMC.

[113]  Michael D. Alder,et al.  Adaptive quadratic neural nets , 1992, Pattern Recognit. Lett..

[114]  L. Jones A Simple Lemma on Greedy Approximation in Hilbert Space and Convergence Rates for Projection Pursuit Regression and Neural Network Training , 1992 .

[115]  Geoffrey E. Hinton,et al.  How neural networks learn from experience. , 1992, Scientific American.

[116]  David J. C. MacKay,et al.  Bayesian Interpolation , 1992, Neural Computation.

[117]  Qinghua Zhang,et al.  Wavelet networks , 1992, IEEE Trans. Neural Networks.

[118]  Halbert White,et al.  Artificial Neural Networks: Approximation and Learning Theory , 1992 .

[119]  Radford M. Neal Connectionist Learning of Belief Networks , 1992, Artif. Intell..

[120]  Elie Bienenstock,et al.  Neural Networks and the Bias/Variance Dilemma , 1992, Neural Computation.

[121]  David J. C. MacKay,et al.  A Practical Bayesian Framework for Backpropagation Networks , 1992, Neural Computation.

[122]  Radford M. Neal Bayesian Learning via Stochastic Dynamics , 1992, NIPS.

[123]  Robert M. Burton,et al.  Convergence and divergence in neural networks: Processing of chaos and biological analogy , 1992, Neural Networks.

[124]  Yoshua Bengio,et al.  Global optimization of a neural network-hidden Markov model hybrid , 1992, IEEE Trans. Neural Networks.

[125]  Andrew R. Barron,et al.  Universal approximation bounds for superpositions of a sigmoidal function , 1993, IEEE Trans. Inf. Theory.

[126]  P. Rujan A Fast Method for Calculating the Perceptron with Maximal Stability , 1993 .

[127]  J. Besag,et al.  Spatial Statistics and Bayesian Computation , 1993 .

[128]  Leo Breiman,et al.  Hinging hyperplanes for regression, classification, and function approximation , 1993, IEEE Trans. Inf. Theory.

[129]  Brian D. Ripley,et al.  Statistical aspects of neural networks , 1993 .

[130]  D. M. Titterington,et al.  A small selection of neural network methods and their statistical connections , 1994 .

[131]  D. M. Titterington,et al.  Beyond the binary Boltzmann machine , 1995, IEEE Trans. Neural Networks.

[132]  R. Tibshirani,et al.  Penalized Discriminant Analysis , 1995 .