Bayesian Ensemble Learning for Nonlinear Factor Analysis

An active research topic in machine learning is the development of model structures which would be rich enough to represent relevant aspects of the observations but would still allow efficient learning and inference. Linear factor analysis and related methods such as principal component analysis and independent component analysis are widely used feature extraction and data analysis techniques. They are computationally efficient but are restricted to linear models. Many natural phenomena are nonlinear and therefore several attempts have been made to generalise the model by relaxing the linearity assumption. The suggested approaches have suffered from overfitting and the computational complexity of many of the algorithms scales exponentially with respect to the number of factors, which makes the application of these methods to high dimensional factor spaces infeasible. This thesis describes the development of a nonlinear extension of factor analysis. The learning algorithm is based on Bayesian probability theory and solves many of the problems related to overfitting. The unknown nonlinear generative mapping is modelled by a multi-layer perceptron network. The computational complexity of the algorithm scales quadratically with respect to the dimension of the factor space which makes it possible to use a significantly larger number of factors than with the previous algorithms. The feasibility of the algorithm is demonstrated in experiments with artificial and natural data sets. Extensions which combine the nonlinear model with non-Gaussian and dynamic models for the factors are introduced. c © All rights reserved. No part of the publication may be reproduced, stored in a retrieval system, or transmitted, in any form, or by any means, electronic, mechanical, photocopying, recording, or otherwise, without the prior written permission of the author.

[1]  N. L. Johnson,et al.  Multivariate Analysis , 1958, Nature.

[2]  Aapo Hyvärinen,et al.  Emergence of Phase- and Shift-Invariant Features by Decomposition of Natural Images into Independent Feature Subspaces , 2000, Neural Computation.

[3]  Christopher J. Bishop,et al.  Pulsed Neural Networks , 1998 .

[4]  Aapo Hyvärinen,et al.  Emergence of Topography and Complex Cell Properties from Natural Images using Extensions of ICA , 1999, NIPS.

[5]  Antti Honkela,et al.  Bayesian Non-Linear Independent Component Analysis by Multi-Layer Perceptrons , 2000 .

[6]  Jack D. Cowan,et al.  Faithful Representation Of Separable Input Distributions , 1997, Neural Computation.

[7]  Gregory J. Chaitin,et al.  On the Length of Programs for Computing Finite Binary Sequences , 1966, JACM.

[8]  Neil D. Lawrence,et al.  Approximating Posterior Distributions in Belief Networks Using Mixtures , 1997, NIPS.

[9]  Jorma Rissanen,et al.  Fisher information and stochastic complexity , 1996, IEEE Trans. Inf. Theory.

[10]  Mark A. Girolami,et al.  Self-Organising Neural Networks: Independent Component Analysis and Blind Source Separation , 1999 .

[11]  H. Sebastian Seung,et al.  Learning Generative Models with the Up-Propagation Algorithm , 1997, NIPS.

[12]  R. A. Davidoff From Neuron to Brain , 1977, Neurology.

[13]  최승진 Nonlinear dynamic independent component analysis using state-space and neural network models , 1999 .

[14]  Geoffrey E. Hinton,et al.  Bayesian Learning for Neural Networks , 1995 .

[15]  Te-Won Lee,et al.  Independent Component Analysis , 1998, Springer US.

[16]  Aapo Hyvärinen,et al.  Nonlinear independent component analysis: Existence and uniqueness results , 1999, Neural Networks.

[17]  Erkki Oja,et al.  The nonlinear PCA learning rule in independent component analysis , 1997, Neurocomputing.

[18]  Jürgen Schmidhuber,et al.  Feature Extraction Through LOCOCODE , 1999, Neural Computation.

[19]  D. J. Felleman,et al.  Distributed hierarchical processing in the primate cerebral cortex. , 1991, Cerebral cortex.

[20]  Zoubin Ghahramani,et al.  Learning Nonlinear Dynamical Systems Using an EM Algorithm , 1998, NIPS.

[21]  Geoffrey E. Hinton,et al.  Unsupervised learning : foundations of neural computation , 1999 .

[22]  F. Takens Detecting strange attractors in turbulence , 1981 .

[23]  M. Alexander,et al.  Principles of Neural Science , 1981 .

[24]  Gustavo Deco,et al.  Nonlinear higher-order statistical decorrelation by volume-conserving neural architectures , 1995, Neural Networks.

[25]  A. Kolmogorov Three approaches to the quantitative definition of information , 1968 .

[26]  Kurt Hornik,et al.  Multilayer feedforward networks are universal approximators , 1989, Neural Networks.

[27]  Pierre Comon,et al.  Independent component analysis, A new concept? , 1994, Signal Process..

[28]  J. E. Jackson,et al.  Statistical Factor Analysis and Related Methods: Theory and Applications , 1995 .

[29]  Juha Karhunen,et al.  Nonlinear Independent Component Analysis Using Ensemble Learning: Experiments and Discussion , 2000 .

[30]  David J. C. MacKay,et al.  A Practical Bayesian Framework for Backpropagation Networks , 1992, Neural Computation.

[31]  Harri Lappalainen,et al.  Ensemble learning for independent component analysis , 1999 .

[32]  Michael I. Jordan,et al.  Mean Field Theory for Sigmoid Belief Networks , 1996, J. Artif. Intell. Res..

[33]  Geoffrey E. Hinton,et al.  Keeping the neural networks simple by minimizing the description length of the weights , 1993, COLT '93.

[34]  Michael I. Jordan Graphical Models , 2003 .

[35]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[36]  Harri Lappalainen,et al.  Fast Fixed-point Algorithms for Bayesian Blind Source Separation , 1999 .

[37]  Christopher M. Bishop,et al.  Bayesian PCA , 1998, NIPS.

[38]  S. Kosslyn,et al.  Topographical representations of mental images in primary visual cortex , 1995, Nature.

[39]  H. Lappalainen Fast Algorithms for Bayesian Independent Component Analysis , 2000 .

[40]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[41]  Eric Moulines,et al.  Maximum likelihood for blind separation and deconvolution of noisy signals using mixture models , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[42]  Thomas Waschulzik,et al.  ESyNN - a model to abstractly emulate synchronization in neural networks , 1999 .

[43]  E. Lorenz Deterministic nonperiodic flow , 1963 .

[44]  C. S. Wallace,et al.  Estimation and Inference by Compact Coding , 1987 .

[45]  Shun-ichi Amari,et al.  Differential-geometrical methods in statistics , 1985 .

[46]  Erkki Oja,et al.  Independent Component Analysis for Identification of Artifacts in Magnetoencephalographic Recordings , 1997, NIPS.

[47]  C D Gilbert,et al.  Circuitry, architecture, and functional dynamics of visual cortex. , 1993, Cerebral cortex.

[48]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[49]  Geoffrey E. Hinton,et al.  A View of the Em Algorithm that Justifies Incremental, Sparse, and other Variants , 1998, Learning in Graphical Models.

[50]  H. Lappalainen,et al.  Using an MDL-based cost function with neural networks , 1998, 1998 IEEE International Joint Conference on Neural Networks Proceedings. IEEE World Congress on Computational Intelligence (Cat. No.98CH36227).

[51]  Geoffrey J. McLachlan,et al.  Mixture models : inference and applications to clustering , 1989 .

[52]  L. B. Almeida,et al.  An Objective Function for Independence , 1996, Proceedings of International Conference on Neural Networks (ICNN'96).

[53]  L. Joseph,et al.  Bayesian Statistics: An Introduction , 1989 .

[54]  Petteri Pajunen,et al.  Nonlinear Blind Source Separation by Self-Organizing Maps , 1996 .

[55]  Jonathan J. Oliver Introduction to Minimum Encoding Inference , 1994 .

[56]  R Hecht-Nielsen,et al.  Replicator neural networks for universal optimal source coding. , 1995, Science.

[57]  Volker Tresp,et al.  Fisher Scoring and a Mixture of Modes Approach for Approximate Inference and Learning in Nonlinear State Space Models , 1998, NIPS.

[58]  Christopher M. Bishop,et al.  GTM: The Generative Topographic Mapping , 1998, Neural Computation.

[59]  Jean-Francois Cardoso,et al.  Approximate likelihood for noisy mixtures , 1999 .

[60]  Gilles Burel,et al.  Blind separation of sources: A nonlinear neural algorithm , 1992, Neural Networks.

[61]  S. Hyakin,et al.  Neural Networks: A Comprehensive Foundation , 1994 .

[62]  David Barber,et al.  Ensemble Learning for Multi-Layer Networks , 1997, NIPS.

[63]  J. M. Hupé,et al.  Cortical feedback improves discrimination between figure and background by V1, V2 and V3 neurons , 1998, Nature.

[64]  W. Singer,et al.  Stimulus-specific neuronal oscillations in orientation columns of cat visual cortex. , 1989, Proceedings of the National Academy of Sciences of the United States of America.

[65]  Petteri Pajunen,et al.  Blind source separation using algorithmic information theory , 1998, Neurocomputing.

[66]  J. Rissanen A UNIVERSAL PRIOR FOR INTEGERS AND ESTIMATION BY MINIMUM DESCRIPTION LENGTH , 1983 .

[67]  David J.C. Mackay,et al.  Density networks , 2000 .

[68]  Jürgen Schmidhuber,et al.  Flat Minima , 1997, Neural Computation.

[69]  Michael C. Mozer,et al.  An electric field approach to independent component analysis , 2000 .

[70]  D. Chakrabarti,et al.  A fast fixed - point algorithm for independent component analysis , 1997 .

[71]  P. Kidwell,et al.  The universal turing machine: a half-century survey , 1996, IEEE Annals of the History of Computing.

[72]  R. Zemel A minimum description length framework for unsupervised learning , 1994 .

[73]  Hagai Attias,et al.  Independent Factor Analysis , 1999, Neural Computation.

[74]  Jean-François Cardoso,et al.  Multidimensional independent component analysis , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[75]  Nikunj C. Oza,et al.  Online Ensemble Learning , 2000, AAAI/IAAI.

[76]  David J. C. MacKay,et al.  Developments in Probabilistic Modelling with Neural Networks - Ensemble Learning , 1995, SNN Symposium on Neural Networks.

[77]  Geoffrey E. Hinton,et al.  Hierarchical Non-linear Factor Analysis and Topographic Maps , 1997, NIPS.

[78]  H. Sebastian Seung,et al.  Unsupervised Learning by Convex and Conic Coding , 1996, NIPS.

[79]  Jonathan J. Oliver,et al.  MDL and MML: Similarities and differences , 1994 .

[80]  Huaiyu Zhu On Information and Sufficiency , 1997 .

[81]  Glen G. Langdon,et al.  Universal modeling and coding , 1981, IEEE Trans. Inf. Theory.

[82]  R. T. Cox Probability, frequency and reasonable expectation , 1990 .

[83]  Harri Valpola Nonlinear independent component analysis using ensemble learning: Theory , 2000 .

[84]  S. Hochreiter,et al.  Lococode Performs Nonlinear ICA Without Knowing The Number Of Sources , 1999 .

[85]  Brian Everitt,et al.  An Introduction to Latent Variable Models , 1984 .

[86]  Christian Jutten,et al.  Blind separation of sources, part I: An adaptive algorithm based on neuromimetic architecture , 1991, Signal Process..

[87]  William T. Freeman,et al.  The generic viewpoint assumption in a Bayesian framework , 1996 .

[88]  Ming Li,et al.  An Introduction to Kolmogorov Complexity and Its Applications , 2019, Texts in Computer Science.

[89]  Geoffrey E. Hinton,et al.  Autoencoders, Minimum Description Length and Helmholtz Free Energy , 1993, NIPS.

[90]  George Boole,et al.  An Investigation of the Laws of Thought: Frontmatter , 2009 .

[91]  Aapo Hyvärinen,et al.  Survey on Independent Component Analysis , 1999 .

[92]  Aapo Hyvärinen,et al.  Fast and robust fixed-point algorithms for independent component analysis , 1999, IEEE Trans. Neural Networks.

[93]  John Moody,et al.  Fast Learning in Networks of Locally-Tuned Processing Units , 1989, Neural Computation.

[94]  Shun-ichi Amari,et al.  Natural Gradient Works Efficiently in Learning , 1998, Neural Computation.

[95]  Radford M. Neal Connectionist Learning of Belief Networks , 1992, Artif. Intell..

[96]  Geoffrey E. Hinton,et al.  Variational Learning for Switching State-Space Models , 2000, Neural Computation.

[97]  Samuel Kaski,et al.  Self-Organized Formation of Various Invariant-Feature Filters in the Adaptive-Subspace SOM , 1997, Neural Computation.

[98]  Michael I. Jordan,et al.  An Introduction to Variational Methods for Graphical Models , 1999, Machine Learning.

[99]  C. S. Wallace,et al.  An Information Measure for Classification , 1968, Comput. J..

[100]  Rohan A. Baxter,et al.  MML and Bayesianism: similarities and differences: introduction to minimum encoding inference Part , 1994 .