Optimized Linear Imputation

Often in real-world datasets, especially in high dimensional data, some feature values are missing. Since most data analysis and statistical methods do not handle gracefully missing values, the first step in the analysis requires the imputation of missing values. Indeed, there has been a long standing interest in methods for the imputation of missing values as a pre-processing step. One recent and effective approach, the IRMI stepwise regression imputation method, uses a linear regression model for each real-valued feature on the basis of all other features in the dataset. However, the proposed iterative formulation lacks convergence guarantee. Here we propose a closely related method, stated as a single optimization problem and a block coordinate-descent solution which is guaranteed to converge to a local minimum. Experiments show results on both synthetic and benchmark datasets, which are comparable to the results of the IRMI method whenever it converges. However, while in the set of experiments described here IRMI often does not converge, the performance of our methods is shown to be markedly superior in comparison with other methods.

[1]  P. rankV Low-Rank Matrix Recovery from Row-and-Column Affine Measurements , 2015 .

[2]  Pascal Vincent,et al.  Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion , 2010, J. Mach. Learn. Res..

[3]  Aníbal R. Figueiras-Vidal,et al.  Pattern classification with missing data: a review , 2010, Neural Computing and Applications.

[4]  Ken P Kleinman,et al.  Much Ado About Nothing , 2007, The American statistician.

[5]  Therese D. Pigott,et al.  A Review of Methods for Missing Data , 2001 .

[6]  D. Heitjan,et al.  Distinguishing “Missing at Random” and “Missing Completely at Random” , 1996 .

[7]  R. Fisher THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS , 1936 .

[8]  Paul Horton,et al.  A Probabilistic Classification System for Predicting the Cellular Localization Sites of Proteins , 1996, ISMB.

[9]  Orr Spiegel,et al.  AcceleRater: a web application for supervised learning of behavioral modes from acceleration measurements , 2014, Movement ecology.

[10]  Paulo Cortez,et al.  Modeling wine preferences by data mining from physicochemical properties , 2009, Decis. Support Syst..

[11]  Mickael Guedj,et al.  A Comparison of Six Methods for Missing Data Imputation , 2015 .

[12]  Stef van Buuren,et al.  MICE: Multivariate Imputation by Chained Equations in R , 2011 .

[13]  Dafna Shahaf,et al.  Ballpark Learning: Estimating Labels from Rough Group Comparisons , 2016, ECML/PKDD.

[14]  D. Rubin,et al.  Statistical Analysis with Missing Data , 1988 .

[15]  D. Rubinfeld,et al.  Hedonic housing prices and the demand for clean air , 1978 .

[16]  T. Stijnen,et al.  Review: a gentle introduction to imputation of missing values. , 2006, Journal of clinical epidemiology.

[17]  Pınar Tüfekci,et al.  Prediction of full load electrical power output of a base load operated combined cycle power plant using machine learning methods , 2014 .

[18]  Daphna Weinshall,et al.  Topic modeling of behavioral modes using sensor data , 2016, International Journal of Data Science and Analytics.

[19]  Paula Diehr,et al.  Imputation of missing longitudinal data: a comparison of methods. , 2003, Journal of clinical epidemiology.

[20]  D. Rubin Multiple Imputation After 18+ Years , 1996 .

[21]  Daphna Weinshall,et al.  Matrix factorization approach to behavioral mode analysis from acceleration data , 2015, 2015 IEEE International Conference on Data Science and Advanced Analytics (DSAA).

[22]  Peter Filzmoser,et al.  Iterative stepwise regression imputation using standard and robust methods , 2011, Comput. Stat. Data Anal..

[23]  Yisheng Lv,et al.  A deep learning based approach for traffic data imputation , 2014, 17th International IEEE Conference on Intelligent Transportation Systems (ITSC).

[24]  John Van Hoewyk,et al.  A multivariate technique for multiply imputing missing values using a sequence of regression models , 2001 .

[25]  R. Little A Test of Missing Completely at Random for Multivariate Data with Missing Values , 1988 .

[26]  S. van Buuren,et al.  Flexible mutlivariate imputation by MICE , 1999 .