Ensemble active imputation for incomplete data

Real data is often incomplete, which hinders its usability and learnability. A reasonable machine learning scenario is to obtain some values and labels at cost upon request. In this paper, we propose a new ensemble active missing imputation (EAMI) algorithm to handle the learning task. First, we design five missing imputation methods, including mean filling, cubic spline interpolation filling, sample-based collaborative filtering weighed filling, attribute-based collaborative filtering weighted filling and k-nearest neighbor (KNN) filling. Second, we propose an ensemble imputation model through the linear weighting of attribute prediction values. Third, We propose a three-way decisions model that uses the variance of the predicted values to fill in missing values by querying true label or using predicted values. We conduct experiments on University of California Irvine(UCI) datasets. The results of significance test verify the effectiveness of EAMI and its superiority over KNN missing data imputation algorithms.

[1]  Z. Nadir,et al.  Pathloss Determination Using Okumura-Hata Model And Spline Interpolation For Missing Data For Oman , 2008 .

[2]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[3]  Ivan G. Costa,et al.  Impact of missing data imputation methods on gene expression clustering and classification , 2015, BMC Bioinformatics.

[4]  Christophe Crambes,et al.  Regression imputation in the functional linear model with missing values in the response , 2019, Journal of Statistical Planning and Inference.

[5]  Foster J. Provost,et al.  Active Feature-Value Acquisition , 2009, Manag. Sci..

[6]  Hector D. Patiño,et al.  Energy Associated Tuning Method for Short-Term Series Forecasting by Complete and Incomplete Datasets , 2017, J. Artif. Intell. Soft Comput. Res..

[7]  Zhiqiang Zheng,et al.  On active learning for data acquisition , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[8]  D. Kibler,et al.  Instance-based learning algorithms , 2004, Machine Learning.

[9]  Tero Aittokallio,et al.  Dealing with missing values in large-scale studies: microarray data imputation and beyond , 2010, Briefings Bioinform..

[10]  Runmin Wei,et al.  Missing Value Imputation Approach for Mass Spectrometry-based Metabolomics Data , 2018, Scientific Reports.

[11]  Thore Graepel,et al.  Kernel Matrix Completion by Semidefinite Programming , 2002, ICANN.

[12]  Kiyoshi Asai,et al.  The em Algorithm for Kernel Matrix Completion with Auxiliary Data , 2003, J. Mach. Learn. Res..

[13]  Shaul Markovitch,et al.  Anytime Induction of Cost-sensitive Trees , 2007, NIPS.

[14]  Bing Shi,et al.  Regression-based three-way recommendation , 2017, Inf. Sci..

[15]  Peter Clark,et al.  The CN2 Induction Algorithm , 1989, Machine Learning.

[16]  Siyuan Liu,et al.  Anomaly Detection from Incomplete Data , 2014, TKDD.

[17]  Dabeeruddin Syed,et al.  Techniques to deal with missing data , 2016, 2016 5th International Conference on Electronic Devices, Systems and Applications (ICEDSA).

[18]  Yiyu Yao,et al.  Three-Way Decisions and Cognitive Computing , 2016, Cognitive Computation.

[19]  Wei Ping Loh,et al.  Data Treatment Effects on Classification Accuracies of Bipedal Running and Walking Motions , 2014, SCDM.

[20]  Ohbyung Kwon,et al.  Effects of data set features on the performances of classification algorithms , 2013, Expert Syst. Appl..

[21]  Lawrence Carin,et al.  Cost-sensitive feature acquisition and classification , 2007, Pattern Recognit..

[22]  Chowdhury Farhan Ahmed,et al.  An effective method for classification with missing values , 2018, Applied Intelligence.

[23]  K. Thangavel,et al.  Missing value imputation using unsupervised machine learning techniques , 2019, Soft Computing.

[24]  Shichao Zhang,et al.  Noisy data elimination using mutual k-nearest neighbor for classification mining , 2012, J. Syst. Softw..

[25]  Dilek Z. Hakkani-Tür,et al.  Active learning: theory and applications to automatic speech recognition , 2005, IEEE Transactions on Speech and Audio Processing.

[26]  Basav Roychoudhury,et al.  Handling missing values: A study of popular imputation packages in R , 2018, Knowl. Based Syst..

[27]  Lawrence Carin,et al.  On Classification with Incomplete Data , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[28]  I. Lillo-Bravo,et al.  Solar resource assessment in Seville, Spain. Statistical characterisation of solar radiation at different time resolutions , 2016 .

[29]  Foster J. Provost,et al.  Active feature-value acquisition for classifier induction , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).