Learning from Incomplete Data

Real-world learning tasks often involve high-dimensional data sets with complex patterns of missing features. In this paper we review the problem of learning from incomplete data from two statistical perspectives---the likelihood-based and the Bayesian. The goal is two-fold: to place current neural network approaches to missing data within a statistical framework, and to describe a set of algorithms, derived from the likelihood-based framework, that handle clustering, classification, and function approximation <from incomplete data in a principled and efficient manner. These algorithms are based on mixture modeling and make two distinct appeals to the Expectation-Maximization (EM) principle (Dempster, Laird, and Rubin 1977)---both for the estimation of mixture components and for coping with the missing data.

[1]  N. Metropolis,et al.  Equation of State Calculations by Fast Computing Machines , 1953, Resonance.

[2]  W. K. Hastings,et al.  Monte Carlo Sampling Methods Using Markov Chains and Their Applications , 1970 .

[3]  Donald B. Rubin,et al.  Max-imum Likelihood from Incomplete Data , 1972 .

[4]  G. C. Tiao,et al.  Bayesian inference in statistical analysis , 1973 .

[5]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[6]  Peter E. Hart,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[7]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[8]  Donald Geman,et al.  Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images , 1984, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  R. Little,et al.  Maximum likelihood estimation for mixed continuous and categorical data with missing values , 1985 .

[10]  Geoffrey E. Hinton,et al.  Learning and relearning in Boltzmann machines , 1986 .

[11]  W. Wong,et al.  The calculation of posterior distributions by data augmentation , 1987 .

[12]  James Kelly,et al.  AutoClass: A Bayesian Classification System , 1993, ML.

[13]  John Moody,et al.  Fast Learning in Networks of Locally-Tuned Processing Units , 1989, Neural Computation.

[14]  David J. Hand,et al.  Mixture Models: Inference and Applications to Clustering , 1989 .

[15]  Halbert White,et al.  Learning in Artificial Neural Networks: A Statistical Perspective , 1989, Neural Computation.

[16]  J. Ross Quinlan,et al.  Unknown Attribute Values in Induction , 1989, ML.

[17]  Tomaso A. Poggio,et al.  Extensions of a Theory of Networks for Approximation and Learning , 1990, NIPS.

[18]  Donald F. Specht,et al.  A general regression neural network , 1991, IEEE Trans. Neural Networks.

[19]  Steven J. Nowlan,et al.  Soft competitive adaptation: neural network learning algorithms based on fitting statistical mixtures , 1991 .

[20]  J. Freidman,et al.  Multivariate adaptive regression splines , 1991 .

[21]  Wray L. Buntine,et al.  Bayesian Back-Propagation , 1991, Complex Syst..

[22]  Zoubin Ghahramani,et al.  Solving inverse problems using an EM approach to density estimation , 1993 .

[23]  Volker Tresp,et al.  Training Neural Networks with Deficient Data , 1993, NIPS.

[24]  Robert A. Jacobs,et al.  Hierarchical Mixtures of Experts and the EM Algorithm , 1993, Neural Computation.

[25]  S. Srihari Mixture Density Networks , 1994 .

[26]  John L. Casti,et al.  The Theory of Networks , 1995 .

[27]  R. Tibshirani,et al.  Discriminant Analysis by Gaussian Mixtures , 1996 .