K-Means Clustering With Incomplete Data

Clustering has been intensively studied in machine learning and data mining communities. Although demonstrating promising performance in various applications, most of the existing clustering algorithms cannot efficiently handle clustering tasks with incomplete features which is common in practical applications. To address this issue, we propose a novel K-means based clustering algorithm which unifies the clustering and imputation into one single objective function. It makes these two processes be negotiable with each other to achieve optimality. Furthermore, we design an alternate optimization algorithm to solve the resultant optimization problem and theoretically prove its convergence. The comprehensive experimental study has been conducted on nine UCI benchmark datasets and real-world applications to evaluate the performance of the proposed algorithm, and the experimental results have clearly demonstrated the effectiveness of our algorithm which outperforms several commonly-used methods for incomplete data clustering.

[1]  Jitender S. Deogun,et al.  Towards Missing Data Imputation: A Study of Fuzzy K-means Clustering Method , 2004, Rough Sets and Current Trends in Computing.

[2]  Piyush Rai,et al.  Multiview Clustering with Incomplete Views , 2010 .

[3]  Xin Liu,et al.  Atherosclerotic Plaque Pathological Analysis by Unsupervised $K$ -Means Clustering , 2018, IEEE Access.

[4]  Michel Verleysen,et al.  K nearest neighbours with mutual information for simultaneous classification and missing data imputation , 2009, Neurocomputing.

[5]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[6]  Liqing Zhang,et al.  Feature learning from incomplete EEG with denoising autoencoder , 2014, Neurocomputing.

[7]  Zhenhong Jia,et al.  A Practical GrabCut Color Image Segmentation Based on Bayes Classification and Simple Linear Iterative Clustering , 2017, IEEE Access.

[8]  Chee Peng Lim,et al.  A Hybrid Neural Network System for Pattern Classification Tasks with Missing Features , 2005, IEEE Trans. Pattern Anal. Mach. Intell..

[9]  Hongjun Chen,et al.  An EM algorithm for learning sparse and overcomplete representations , 2004, Neurocomputing.

[10]  Ralf Tönjes,et al.  CityPulse: Large Scale Data Analytics Framework for Smart Cities , 2016, IEEE Access.

[11]  Ethem Alpaydin,et al.  Combining multiple representations and classifiers for pen-based handwritten digit recognition , 1997, Proceedings of the Fourth International Conference on Document Analysis and Recognition.

[12]  Justin Zhan,et al.  Finding Top- $k$ Dominance on Incomplete Big Data Using MapReduce Framework , 2018, IEEE Access.

[13]  Miao Pan,et al.  Tensor Voting Techniques and Applications in Mobile Trace Inference , 2015, IEEE Access.

[14]  Leslie S. Smith,et al.  A neural network-based framework for the reconstruction of incomplete data sets , 2010, Neurocomputing.

[15]  Tze-Yun Leong,et al.  Fuzzy K-means clustering with missing values , 2001, AMIA.

[16]  James C. Bezdek,et al.  Fuzzy c-means clustering of incomplete data , 2001, IEEE Trans. Syst. Man Cybern. Part B.

[17]  Qingsheng Zhu,et al.  An Effective Algorithm Based on Density Clustering Framework , 2017, IEEE Access.

[18]  Donald C. Wunsch,et al.  Clustering Data of Mixed Categorical and Numerical Type With Unsupervised Feature Learning , 2015, IEEE Access.

[19]  Edmondo Trentin,et al.  Techniques for dealing with incomplete data: a tutorial and survey , 2014, Pattern Analysis and Applications.

[20]  João P. P. Gomes,et al.  Euclidean distance estimation in incomplete datasets , 2017, Neurocomputing.

[21]  Yan Wang,et al.  Skyline Preference Query Based on Massive and Incomplete Dataset , 2017, IEEE Access.

[22]  S. C. Johnson Hierarchical clustering schemes , 1967, Psychometrika.

[23]  Alan M. Frieze,et al.  Clustering Large Graphs via the Singular Value Decomposition , 2004, Machine Learning.

[24]  Victor C. M. Leung,et al.  Clustering Approach Based on Mini Batch Kmeans for Intrusion Detection System Over Big Data , 2018, IEEE Access.

[25]  Chao-Ton Su,et al.  A selective Bayes classifier with meta-heuristics for incomplete data , 2013, Neurocomputing.

[26]  Richard C. T. Lee,et al.  Application of clustering to estimate missing data and improve data integrity , 1976, ICSE '76.

[27]  Bo Yuan,et al.  Density-Based Multiscale Analysis for Clustering in Strong Noise Settings With Varying Densities , 2018, IEEE Access.

[28]  Keith C. C. Chan,et al.  Learning Latent Factors for Community Identification and Summarization , 2018, IEEE Access.

[29]  Heiko Timm,et al.  Different approaches to fuzzy clustering of incomplete datasets , 2004, Int. J. Approx. Reason..

[30]  Zhenzhou Wang,et al.  Determining the Clustering Centers by Slope Difference Distribution , 2017, IEEE Access.

[31]  Derya Birant,et al.  ST-DBSCAN: An algorithm for clustering spatial-temporal data , 2007, Data Knowl. Eng..

[32]  Zoran Zivkovic,et al.  Improved adaptive Gaussian mixture model for background subtraction , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[33]  Muhammad Younus Javed,et al.  A hierarchical k-means clustering based fingerprint quality classification , 2012, Neurocomputing.

[34]  Cheng Deng,et al.  Assisting Attraction Classification by Harvesting Web Data , 2017, IEEE Access.

[35]  Min Chen,et al.  Disease Prediction by Machine Learning Over Big Data From Healthcare Communities , 2017, IEEE Access.

[36]  Tutut Herawan,et al.  A Systematic Review on Educational Data Mining , 2017, IEEE Access.

[37]  Witold Pedrycz,et al.  Interval kernel Fuzzy C-Means clustering of incomplete data , 2017, Neurocomputing.

[38]  Anil K. Jain Data clustering: 50 years beyond K-means , 2008, Pattern Recognit. Lett..

[39]  Qian Li,et al.  A Model of Telecommunication Network Performance Anomaly Detection Based on Service Features Clustering , 2017, IEEE Access.