Advanced methods for missing values imputation based on similarity learning

The real-world data analysis and processing using data mining techniques often are facing observations that contain missing values. The main challenge of mining datasets is the existence of missing values. The missing values in a dataset should be imputed using the imputation method to improve the data mining methods’ accuracy and performance. There are existing techniques that use k-nearest neighbors algorithm for imputing the missing values but determining the appropriate k value can be a challenging task. There are other existing imputation techniques that are based on hard clustering algorithms. When records are not well-separated, as in the case of missing data, hard clustering provides a poor description tool in many cases. In general, the imputation depending on similar records is more accurate than the imputation depending on the entire dataset's records. Improving the similarity among records can result in improving the imputation performance. This paper proposes two numerical missing data imputation methods. A hybrid missing data imputation method is initially proposed, called KI, that incorporates k-nearest neighbors and iterative imputation algorithms. The best set of nearest neighbors for each missing record is discovered through the records similarity by using the k-nearest neighbors algorithm (kNN). To improve the similarity, a suitable k value is estimated automatically for the kNN. The iterative imputation method is then used to impute the missing values of the incomplete records by using the global correlation structure among the selected records. An enhanced hybrid missing data imputation method is then proposed, called FCKI, which is an extension of KI. It integrates fuzzy c-means, k-nearest neighbors, and iterative imputation algorithms to impute the missing data in a dataset. The fuzzy c-means algorithm is selected because the records can belong to multiple clusters at the same time. This can lead to further improvement for similarity. FCKI searches a cluster, instead of the whole dataset, to find the best k-nearest neighbors. It applies two levels of similarity to achieve a higher imputation accuracy. The performance of the proposed imputation techniques is assessed by using fifteen datasets with variant missing ratios for three types of missing data; MCAR, MAR, MNAR. These different missing data types are generated in this work. The datasets with different sizes are used in this paper to validate the model. Therefore, proposed imputation techniques are compared with other missing data imputation methods by means of three measures; the root mean square error (RMSE), the normalized root mean square error (NRMSE), and the mean absolute error (MAE). The results show that the proposed methods achieve better imputation accuracy and require significantly less time than other missing data imputation methods.

[1]  Miriam Seoane Santos,et al.  How distance metrics influence missing data imputation with k-nearest neighbours , 2020, Pattern Recognit. Lett..

[2]  Durga Toshniwal,et al.  Missing Value Imputation Based on K-Mean Clustering with Weighted Distance , 2010, IC3.

[3]  Felix Naumann,et al.  Data Quality in Genome Databases , 2003, ICIQ.

[4]  Ruben-Dario Pinzon-Morales,et al.  Pattern recognition of surface EMG biological signals by means of Hilbert spectrum and fuzzy clustering. , 2011, Advances in experimental medicine and biology.

[5]  Stef van Buuren,et al.  MICE: Multivariate Imputation by Chained Equations in R , 2011 .

[6]  Quan Pan,et al.  Classification of incomplete data based on belief functions and K-nearest neighbors , 2015, Knowl. Based Syst..

[7]  Mohd Najib Mohd Salleh,et al.  FCMPSO: An Imputation for Missing Data Features in Heart Disease Classification , 2017 .

[8]  William A. Young,et al.  A survey of methodologies for the treatment of missing values within datasets: limitations and benefits , 2011 .

[9]  Francisco Herrera,et al.  Data Preprocessing in Data Mining , 2014, Intelligent Systems Reference Library.

[10]  J. Bezdek,et al.  FCM: The fuzzy c-means clustering algorithm , 1984 .

[11]  Vadlamani Ravi,et al.  A new online data imputation method based on general regression auto associative neural network , 2014, Neurocomputing.

[12]  Chao Jiang,et al.  CKNNI: An Improved KNN-Based Missing Value Handling Technique , 2015, ICIC.

[13]  Roderick J. A. Little,et al.  Statistical Analysis with Missing Data: Little/Statistical Analysis with Missing Data , 2002 .

[14]  A. T. Sree Dhevi Imputing missing values using Inverse Distance Weighted Interpolation for time series data , 2014, 2014 Sixth International Conference on Advanced Computing (ICoAC).

[15]  Trupti M. Kodinariya,et al.  Review on determining number of Cluster in K-Means Clustering , 2013 .

[16]  Md Zahidul Islam,et al.  Missing value imputation using a fuzzy clustering-based EM approach , 2015, Knowledge and Information Systems.

[17]  Sebastián Lozano,et al.  Parallel Fuzzy c-Means Clustering for Large Data Sets , 2002, Euro-Par.

[18]  Chandrasekhar Kambhampati,et al.  Handling missing values in data mining - A case study of heart failure dataset , 2012, 2012 9th International Conference on Fuzzy Systems and Knowledge Discovery.

[19]  Enrico Zio,et al.  Efficient residuals pre-processing for diagnosing multi-class faults in a doubly fed induction generator, under missing data scenarios , 2014, Expert Syst. Appl..

[20]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[21]  Shichao Zhang,et al.  Clustering-based Missing Value Imputation for Data Preprocessing , 2006, 2006 4th IEEE International Conference on Industrial Informatics.

[22]  LiewAlan Wee-Chung,et al.  Missing value imputation for the analysis of incomplete traffic accident data , 2016 .

[23]  Petra Perner,et al.  Data Mining - Concepts and Techniques , 2002, Künstliche Intell..

[24]  Md Zahidul Islam,et al.  kDMI: A Novel Method for Missing Values Imputation Using Two Levels of Horizontal Partitioning in a Data set , 2013, ADMA.

[25]  Russ B. Altman,et al.  Missing value estimation methods for DNA microarrays , 2001, Bioinform..

[26]  Roozbeh Razavi-Far,et al.  Similarity-learning information-fusion schemes for missing data imputation , 2020, Knowl. Based Syst..

[27]  Negin Daneshpour,et al.  Estimating missing data using novel correlation maximization based methods , 2020, Appl. Soft Comput..

[28]  Yonggang Wang,et al.  Estimation of missing values in heterogeneous traffic data: Application of multimodal deep learning model , 2020, Knowl. Based Syst..

[29]  Nicole A. Lazar,et al.  Statistical Analysis With Missing Data , 2003, Technometrics.

[30]  Khaled M. Fouad,et al.  Intelligent approach for large-scale data mining , 2020, Int. J. Comput. Appl. Technol..

[31]  Jelke Bethlehem,et al.  Applied Survey Methods: A Statistical Perspective , 2009 .

[32]  Ahcene Bounceur,et al.  Handling Missing Data Problems with Sampling Methods , 2014, 2014 International Conference on Advanced Networking Distributed Systems and Applications.

[33]  Jaideep Srivastava,et al.  Automatic instance selection via locality constrained sparse representation for missing value estimation , 2015, Knowl. Based Syst..

[34]  Jelke Bethlehem,et al.  Applied Survey Methods , 2009 .

[35]  Zahidul Islam,et al.  Data Quality Improvement by Imputation of Missing Values , 2013 .

[36]  T. Schneider Analysis of Incomplete Climate Data: Estimation of Mean Values and Covariance Matrices and Imputation of Missing Values. , 2001 .

[37]  Roberto Santana,et al.  An extensive analysis of the interaction between missing data types, imputation methods, and supervised classifiers , 2017, Expert Syst. Appl..

[38]  Roderick J A Little,et al.  A Review of Hot Deck Imputation for Survey Non‐response , 2010, International statistical review = Revue internationale de statistique.

[39]  Jitender S. Deogun,et al.  Towards Missing Data Imputation: A Study of Fuzzy K-means Clustering Method , 2004, Rough Sets and Current Trends in Computing.

[40]  Negin Daneshpour,et al.  Missing value imputation using a novel grey based fuzzy c-means, mutual information based feature selection, and regression model , 2019, Expert Syst. Appl..

[41]  RamakrishnanRaghu,et al.  Mining Very Large Databases , 1999 .

[42]  Aleksey Bilogur,et al.  Missingno: a missing data visualization suite , 2018, J. Open Source Softw..

[43]  Alan Wee-Chung Liew,et al.  Missing value imputation for the analysis of incomplete traffic accident data , 2014, Inf. Sci..

[44]  Harri Niska,et al.  Methods for imputation of missing values in air quality data sets , 2004 .

[45]  Md Zahidul Islam,et al.  FIMUS: A framework for imputing missing values using co-appearance, correlation and similarity analysis , 2014, Knowl. Based Syst..

[46]  Christopher Brooks,et al.  A Statistical Framework for Predictive Model Evaluation in MOOCs , 2017, L@S.

[47]  John L.P. Thompson,et al.  Missing data , 2004, Amyotrophic lateral sclerosis and other motor neuron disorders : official publication of the World Federation of Neurology, Research Group on Motor Neuron Diseases.

[48]  Wan-Chi Siu,et al.  Iterative bicluster-based least square framework for estimation of missing values in microarray gene expression data , 2012, Pattern Recognit..

[49]  Tapio Pahikkala,et al.  Missing data resilient decision-making for healthcare IoT through personalization: A case study on maternal health , 2019, Future Gener. Comput. Syst..

[50]  Gustavo E. A. P. A. Batista,et al.  An analysis of four missing data treatment methods for supervised learning , 2003, Appl. Artif. Intell..

[51]  Robert Tibshirani,et al.  Spectral Regularization Algorithms for Learning Large Incomplete Matrices , 2010, J. Mach. Learn. Res..