Order-Sensitive Imputation for Clustered Missing Values

The issue of missing values (MVs) has appeared widely in real-world datasets and hindered the use of many statistical or machine learning algorithms for data analytics due to their incompetence in handling incomplete datasets. To address this issue, several MV imputation algorithms have been developed. However, these approaches do not perform well when most of the incomplete tuples are clustered with each other, coined here as the Clustered Missing Values Phenomenon, which attributes to the lack of sufficient complete tuples near an MV for imputation. In this paper, we propose the Order-Sensitive Imputation for Clustered Missing values (OSICM) framework, in which missing values are imputed sequentially such that the values filled earlier in the process are also used for later imputation of other MVs. Obviously, the order of imputations is critical to the effectiveness and efficiency of OSICM framework. We formulate the searching of the optimal imputation order as an optimization problem, and show its NP-hardness. Furthermore, we devise an algorithm to find the exact optimal solution and propose two approximate/heuristic algorithms to trade off effectiveness for efficiency. Finally, we conduct extensive experiments on real and synthetic datasets to demonstrate the superiority of our OSICM framework.

[1]  Xiaofeng Song,et al.  Sequential local least squares imputation estimating missing value of microarray data , 2008, Comput. Biol. Medicine.

[2]  Ki-Yeol Kim,et al.  Reuse of imputed data in microarray analysis increases imputation efficiency , 2004, BMC Bioinformatics.

[3]  Peter Goos,et al.  Sequential imputation for missing values , 2007, Comput. Biol. Chem..

[4]  Laurens van der Maaten,et al.  Accelerating t-SNE using tree-based algorithms , 2014, J. Mach. Learn. Res..

[5]  E.L. Lawler,et al.  Optimization and Approximation in Deterministic Sequencing and Scheduling: a Survey , 1977 .

[6]  Russ B. Altman,et al.  Missing value estimation methods for DNA microarrays , 2001, Bioinform..

[7]  Bo Jiang,et al.  Graph matching based on spectral embedding with missing value , 2012, Pattern Recognit..

[8]  Zili Zhang,et al.  Missing Value Estimation for Mixed-Attribute Data Sets , 2011, IEEE Transactions on Knowledge and Data Engineering.

[9]  Shichao Zhang,et al.  Clustering-based Missing Value Imputation for Data Preprocessing , 2006, 2006 4th IEEE International Conference on Industrial Informatics.

[10]  Jianmin Wang,et al.  Enriching Data Imputation with Extensive Similarity Neighbors , 2015, Proc. VLDB Endow..

[11]  Chengqi Zhang,et al.  GBKII: An Imputation Method for Missing Values , 2007, PAKDD.

[12]  Jianzhong Li,et al.  Towards certain fixes with editing rules and master data , 2010, The VLDB Journal.

[13]  Hongan Wang,et al.  Missing Data Imputation: A Fuzzy K-means Clustering Algorithm over Sliding Window , 2009, 2009 Sixth International Conference on Fuzzy Systems and Knowledge Discovery.

[14]  Taghi M. Khoshgoftaar,et al.  Using Classifier-Based Nominal Imputation to Improve Machine Learning , 2011, PAKDD.

[15]  Ahmet Arslan,et al.  A NOVEL HYBRID APPROACH TO ESTIMATING MISSING VALUES IN DATABASES USING K-NEAREST NEIGHBORS AND NEURAL NETWORKS , 2012 .

[16]  Shichao Zhang,et al.  Parimputation: From Imputation and Null-Imputation to Partially Imputation , 2008, IEEE Intell. Informatics Bull..

[17]  Jitendra Malik,et al.  Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[18]  Alagan Anpalagan,et al.  Convergence analysis of multiple imputations particle filters for dealing with missing data in nonlinear problems , 2014, 2014 IEEE International Symposium on Circuits and Systems (ISCAS).

[19]  Jan Karel Lenstra,et al.  Complexity of Scheduling under Precedence Constraints , 1978, Oper. Res..

[20]  Wei Zhang,et al.  Knowledge vault: a web-scale approach to probabilistic knowledge fusion , 2014, KDD.

[21]  Amaury Lendasse,et al.  Extreme learning machine for missing data using multiple imputations , 2016, Neurocomputing.

[22]  Subhash Khot,et al.  Vertex cover might be hard to approximate to within 2-/spl epsiv/ , 2003, 18th IEEE Annual Conference on Computational Complexity, 2003. Proceedings..

[23]  O. Svensson,et al.  Inapproximability Results for Sparsest Cut, Optimal Linear Arrangement, and Precedence Constrained Scheduling , 2007, FOCS 2007.

[24]  Sunil Prabhakar,et al.  ERACER: a database approach for statistical inference and data cleaning , 2010, SIGMOD Conference.

[25]  Huan Liu,et al.  Toward integrating feature selection algorithms for classification and clustering , 2005, IEEE Transactions on Knowledge and Data Engineering.

[26]  Tero Aittokallio,et al.  Dealing with missing values in large-scale studies: microarray data imputation and beyond , 2010, Briefings Bioinform..

[27]  Chengqi Zhang,et al.  POP algorithm: Kernel-based imputation to treat missing values in knowledge discovery from databases , 2009, Expert Syst. Appl..

[28]  Donald T. Searls,et al.  The Utilization of a Known Coefficient of Variation in the Estimation Procedure , 1964 .

[29]  E. Lawler Sequencing Jobs to Minimize Total Weighted Completion Time Subject to Precedence Constraints , 1978 .

[30]  Chao Jiang,et al.  CKNNI: An Improved KNN-Based Missing Value Handling Technique , 2015, ICIC.

[31]  Nicole A. Lazar,et al.  Statistical Analysis With Missing Data , 2003, Technometrics.