Local models and Gaussian mixture models for statistical data processing

In this dissertation, we present local linear models for dimension reduction and Gaussian mixture models for classification and regression. When the data has different structure in different parts of the input space, fitting once global model can be slow and inaccurate. Simple learning models can quickly learn the structure of the data in small (local) regions. Thus, local learning techniques can offer us faster and more accurate model fitting. Gaussian mixture models form a soft local model of the data; data points belong to all "local" regions (Gaussians) at once with differing degrees of membership. Thus, mixture models blend together the different (local) models. We show that local linear dimension reduction approximates maximum likelihood signal extraction for a mixture of Gaussians signal-plus-noise model. The thesis of this document is that "local learning models can perform efficient (fast and accurate) data processing". We propose local linear dimension reduction algorithms which partition the input space and build separate low dimensional coordinate systems in disjoint regions of the input space. We compare the local linear models with a global linear model (principal components analysis) and a global non-linear model (five layered auto-associative neural networks). For speech and image data, the local linear models incur about half the error of the global models while training nearly an order of magnitude faster than the neural networks. Under certain conditions, the local linear models are related to a mixture of Gaussians data model. Motivated by the relation between local linear dimension reduction and Gaussians mixture models we present Gaussian mixture models for classification and regression and propose algorithms for regularizing them. Our results with speech phoneme classification and some benchmark regression tasks indicate that the mixture models perform comparably with a global model (neural networks). To summarize, local models or Gaussian mixture models can be efficient tools for dimension reduction, exploratory data analysis, feature extraction, classification and regression.

[1]  S. Newcomb A Generalized Theory of the Combination of Observations so as to Obtain the Best Result , 1886 .

[2]  K. Pearson Contributions to the Mathematical Theory of Evolution , 1894 .

[3]  H. Hotelling Analysis of a complex of statistical variables into principal components. , 1933 .

[4]  G. E. Peterson,et al.  Control Methods Used in a Study of the Vowels , 1951 .

[5]  T. W. Anderson ASYMPTOTIC THEORY FOR PRINCIPAL COMPONENT ANALYSIS , 1963 .

[6]  J. Kruskal Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis , 1964 .

[7]  V. Hasselblad Estimation of parameters for a mixture of normal distributions , 1966 .

[8]  R. Plomp,et al.  Dimensional analysis of vowel spectra , 1967 .

[9]  L. Baum,et al.  An inequality with applications to statistical estimation for probabilistic functions of Markov processes and to a model for ecology , 1967 .

[10]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[11]  E. B. Andersen,et al.  Modern factor analysis , 1961 .

[12]  R. Plomp,et al.  Perceptual and physical space of vowel sounds. , 1969, The Journal of the Acoustical Society of America.

[13]  J. Wolfe PATTERN CLUSTERING BY MULTIVARIATE MIXTURE ANALYSIS. , 1970, Multivariate behavioral research.

[14]  L. Baum,et al.  A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains , 1970 .

[15]  Keinosuke Fukunaga,et al.  An Algorithm for Finding Intrinsic Dimensionality of Data , 1971, IEEE Transactions on Computers.

[16]  R. Plomp,et al.  Frequency analysis of Dutch vowels from 50 male speakers. , 1973, The Journal of the Acoustical Society of America.

[17]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[18]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[19]  J. Gower,et al.  Methods for statistical data analysis of multivariate observations , 1977, A Wiley publication in applied statistics.

[20]  Allen Gersho,et al.  Asymptotically optimal block quantization , 1979, IEEE Trans. Inf. Theory.

[21]  Robert M. Gray,et al.  An Algorithm for Vector Quantizer Design , 1980, IEEE Trans. Commun..

[22]  M. Aitkin,et al.  Mixture Models, Outliers, and the EM Algorithm , 1980 .

[23]  Murray Aitkin,et al.  Statistical Modelling of Data on Teaching Styles , 1981 .

[24]  Biing-Hwang Juang,et al.  Multiple stage vector quantization for speech coding , 1982, ICASSP.

[25]  Erkki Oja,et al.  Subspace methods of pattern recognition , 1983 .

[26]  Thomas Kailath,et al.  Detection of signals by information theoretic criteria , 1985, IEEE Trans. Acoust. Speech Signal Process..

[27]  R. Cranley,et al.  Multivariate Analysis—Methods and Applications , 1985 .

[28]  G. McLachlan,et al.  Estimation of Allocation Rates in a Cluster Analysis Context , 1985 .

[29]  Geoffrey E. Hinton,et al.  Learning internal representations by error propagation , 1986 .

[30]  J. Schmee An Introduction to Multivariate Statistical Analysis , 1986 .

[31]  A. F. Smith,et al.  Statistical analysis of finite mixture distributions , 1986 .

[32]  Farmer,et al.  Predicting chaotic time series. , 1987, Physical review letters.

[33]  A. Lapedes,et al.  Nonlinear Signal Processing Using Neural Networks , 1987 .

[34]  A. Lapedes,et al.  Nonlinear signal processing using neural networks: Prediction and system modelling , 1987 .

[35]  A. F. Smith,et al.  Statistical analysis of finite mixture distributions , 1986 .

[36]  Teuvo Kohonen,et al.  Self-Organization and Associative Memory , 1988 .

[37]  Terence D. Sanger,et al.  An Optimality Principle for Unsupervised Learning , 1988, NIPS.

[38]  Paul W. Munro,et al.  Principal Components Analysis Of Images Via Back Propagation , 1988, Other Conferences.

[39]  P. Foldiak,et al.  Adaptive network for optimal linear feature extraction , 1989, International 1989 Joint Conference on Neural Networks.

[40]  Erkki Oja,et al.  Neural Networks, Principal Components, and Subspaces , 1989, Int. J. Neural Syst..

[41]  J. E. Glynn,et al.  Numerical Recipes: The Art of Scientific Computing , 1989 .

[42]  John Moody,et al.  Fast Learning in Networks of Locally-Tuned Processing Units , 1989, Neural Computation.

[43]  D. Broad,et al.  Formant estimation by linear trans-formation of the lpc cepstrum , 1989 .

[44]  Ken-ichi Funahashi,et al.  On the approximate realization of continuous mappings by neural networks , 1989, Neural Networks.

[45]  J. Rubner,et al.  A Self-Organizing Network for Principal-Component Analysis , 1989 .

[46]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[47]  Kurt Hornik,et al.  Multilayer feedforward networks are universal approximators , 1989, Neural Networks.

[48]  F. Girosi,et al.  Networks for approximation and learning , 1990, Proc. IEEE.

[49]  Sun-Yuan Kung,et al.  A neural network learning algorithm for adaptive principal component extraction (APEX) , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[50]  Farrell,et al.  Characterizing attractors using local intrinsic dimensions calculated by singular-value decomposition and information-theoretic criteria. , 1990, Physical review. A, Atomic, molecular, and optical physics.

[51]  Garrison W. Cottrell,et al.  EMPATH: Face, Emotion, and Gender Recognition Using Holons , 1990, NIPS.

[52]  Todd K. Leen,et al.  Hebbian feature discovery improves classifier efficiency , 1990, 1990 IJCNN International Joint Conference on Neural Networks.

[53]  John E. Moody,et al.  Note on Learning Rate Schedules for Stochastic Optimization , 1990, NIPS.

[54]  H Hermansky,et al.  Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.

[55]  Terrence J. Sejnowski,et al.  SEXNET: A Neural Network Identifies Sex From Human Faces , 1990, NIPS.

[56]  M. Arozullah,et al.  Higher order data compression with neural networks , 1991, IJCNN-91-Seattle International Joint Conference on Neural Networks.

[57]  James D. Keeler,et al.  Predicting the Future: Advantages of Semilocal Units , 1991, Neural Computation.

[58]  Guy A. Dumont,et al.  Classification of acoustic emission signals via Hebbian feature extraction , 1991, IJCNN-91-Seattle International Joint Conference on Neural Networks.

[59]  Norman Yarvin,et al.  Networks with Learned Unit Response Functions , 1991, NIPS.

[60]  Steven J. Nowlan,et al.  Soft competitive adaptation: neural network learning algorithms based on fitting statistical mixtures , 1991 .

[61]  Anders Krogh,et al.  Introduction to the theory of neural computation , 1994, The advanced book program.

[62]  David J. Marchette,et al.  Adaptive mixtures: Recursive nonparametric pattern recognition , 1991, Pattern Recognit..

[63]  M. Kramer Nonlinear principal component analysis using autoassociative neural networks , 1991 .

[64]  J. Freidman,et al.  Multivariate adaptive regression splines , 1991 .

[65]  T. Leen Dynamics of learning in linear feature-discovery networks , 1991 .

[66]  Allen Gersho,et al.  Vector quantization and signal compression , 1991, The Kluwer international series in engineering and computer science.

[67]  Garrison W. Cottrell,et al.  Non-Linear Dimensionality Reduction , 1992, NIPS.

[68]  John E. Moody,et al.  Fast Pruning Using Principal Components , 1993, NIPS.

[69]  S. Hanson,et al.  Some Solutions to the Missing Feature Problem in Vision , 1993 .

[70]  Zoubin Ghahramani,et al.  Solving inverse problems using an EM approach to density estimation , 1993 .

[71]  Stephen M. Omohundro,et al.  Surface Learning with Applications to Lipreading , 1993, NIPS.

[72]  Michael I. Jordan,et al.  Supervised learning from incomplete data via an EM approach , 1993, NIPS.

[73]  Nanda Kambhatla,et al.  Fast Non-Linear Dimension Reduction , 1993, NIPS.

[74]  Volker Tresp,et al.  Training Neural Networks with Deficient Data , 1993, NIPS.

[75]  Robert A. Jacobs,et al.  Hierarchical Mixtures of Experts and the EM Algorithm , 1993, Neural Computation.

[76]  Terrence J. Sejnowski,et al.  A Mixture Model System for Medical and Machine Diagnosis , 1994, NIPS.

[77]  Stephen M. Omohundro,et al.  Nonlinear Image Interpolation using Manifold Learning , 1994, NIPS.

[78]  Geoffrey E. Hinton,et al.  Recognizing Handwritten Digits Using Mixtures of Linear Models , 1994, NIPS.

[79]  Larry P. Heck,et al.  Gaussian mixture model classifiers for machine monitoring , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[80]  Michael I. Jordan,et al.  Learning from Incomplete Data , 1994 .

[81]  Ronald A. Cole,et al.  Towards automatic collection of the US census , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[82]  José L. Marroquín,et al.  Measure fields for function approximation , 1995, IEEE Trans. Neural Networks.

[83]  Juha Karhunen,et al.  Generalizations of principal component analysis, optimization problems, and neural networks , 1995, Neural Networks.

[84]  Volker Tresp,et al.  Improved Gaussian Mixture Density Estimates Using Bayesian Penalty Terms and Network Averaging , 1995, NIPS.