Nearest neighbour approach in the least-squares data imputation algorithms

Imputation of missing data is of interest in many areas such as survey data editing, medical documentation maintaining and DNA microarray data analysis. This paper is devoted to experimental analysis of a set of imputation methods developed within the so-called least-squares approximation approach, a non-parametric computationally effective multidimensional technique. First, we review global methods for least-squares data imputation. Then we propose extensions of these algorithms based on the nearest neighbours approach. An experimental study of the algorithms on generated data sets is conducted. It appears that straight algorithms may work rather well on data of simple structure and/or with small number of missing entries. However, in more complex cases, the only winner within the least-squares approximation approach is a method, INI, proposed in this paper as a combination of global and local imputation algorithms.

[1]  Trevor Hastie,et al.  Imputing Missing Data for Gene Expression Arrays , 2001 .

[2]  Harry Shum,et al.  Principal Component Analysis with Missing Data and Its Application to Polyhedral Object Modeling , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[3]  Robert P. Goldman,et al.  Imputation of Missing Data Using Machine Learning Techniques , 1996, KDD.

[4]  Michael I. Jordan,et al.  Advances in Neural Information Processing Systems 30 , 1995 .

[5]  S. Hanson,et al.  Some Solutions to the Missing Feature Problem in Vision , 1993 .

[6]  D. Rubin Multiple Imputation After 18+ Years , 1996 .

[7]  Seppo Laaksonen Regression-based nearest neighbour hot decking , 2000, Comput. Stat..

[8]  H. Kiers Weighted least squares fitting using ordinary least squares algorithms , 1997 .

[9]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[10]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[11]  A. Cohen,et al.  Finite Mixture Distributions , 1982 .

[12]  Boris Mirkin,et al.  Mathematical Classification and Clustering , 1996 .

[13]  Nicole A. Lazar,et al.  Statistical Analysis With Missing Data , 2003, Technometrics.

[14]  R. Manne,et al.  Missing values in principal component analysis , 1998 .

[15]  D. A. Wolf Recent advances in descriptive multivariate analysis , 1996 .

[16]  W. J. Krzanowski,et al.  Recent Advances in Descriptive Multivariate Analysis. , 1996 .

[17]  S. Zamir,et al.  Lower Rank Approximation of Matrices by Least Squares With Any Choice of Weights , 1979 .

[18]  A. Macfarlane,et al.  Identifying problems with data collection at a local level: survey of NHS maternity units in England , 1999, BMJ.

[19]  Russ B. Altman,et al.  Missing value estimation methods for DNA microarrays , 2001, Bioinform..

[20]  Ingunn Myrtveit,et al.  Analyzing Data Sets with Missing Data: An Empirical Evaluation of Imputation Methods and Likelihood-Based Methods , 2001, IEEE Trans. Software Eng..

[21]  Andrew W. Moore,et al.  Locally Weighted Learning , 1997, Artificial Intelligence Review.

[22]  Richard Dybowski Classification of incomplete feature vectors by radial basis function networks , 1998, Pattern Recognit. Lett..

[23]  Sam T. Roweis,et al.  EM Algorithms for PCA and SPCA , 1997, NIPS.

[24]  I. Jolliffe Principal Component Analysis , 2002 .

[25]  Michael E. Tipping,et al.  Probabilistic Principal Component Analysis , 1999 .

[26]  David E. Booth,et al.  Analysis of Incomplete Multivariate Data , 2000, Technometrics.

[27]  J. Ross Quinlan,et al.  Unknown Attribute Values in Induction , 1989, ML.

[28]  B. Everitt,et al.  Finite Mixture Distributions , 1981 .