k-POD: A Method for k-Means Clustering of Missing Data

The k-means algorithm is often used in clustering applications but its usage requires a complete data matrix. Missing data, however, are common in many applications. Mainstream approaches to clustering missing data reduce the missing data problem to a complete data formulation through either deletion or imputation but these solutions may incur significant costs. Our k-POD method presents a simple extension of k-means clustering for missing data that works even when the missingness mechanism is unknown, when external information is unavailable, and when there is significant missingness in the data. [Received November 2014. Revised August 2015.]

[1]  Sanjoy Dasgupta,et al.  Random projection trees for vector quantization , 2008, 2008 46th Annual Allerton Conference on Communication, Control, and Computing.

[2]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[3]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[4]  鳥居 泰彦,et al.  世界経済・社会統計 = World development indicators , 1998 .

[5]  J. V. Ryzin,et al.  Clustering Algorithms@@@Cluster Analysis Algorithms@@@Classification and Clustering , 1981 .

[6]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[7]  Pierre Hansen,et al.  NP-hardness of Euclidean sum-of-squares clustering , 2008, Machine Learning.

[8]  Roderick J. A. Little,et al.  Statistical Analysis with Missing Data: Little/Statistical Analysis with Missing Data , 2002 .

[9]  G. Kalton,et al.  Handling missing data in survey research , 1996, Statistical methods in medical research.

[10]  Xiaogang Wang,et al.  CLUES: A non-parametric clustering method based on local shrinking , 2007, Comput. Stat. Data Anal..

[11]  Robert Tibshirani,et al.  Spectral Regularization Algorithms for Learning Large Incomplete Matrices , 2010, J. Mach. Learn. Res..

[12]  Yehuda Koren,et al.  All Together Now: A Perspective on the Netflix Prize , 2010 .

[13]  William M. Rand,et al.  Objective Criteria for the Evaluation of Clustering Methods , 1971 .

[14]  Emmanuel J. Candès,et al.  Exact Matrix Completion via Convex Optimization , 2008, Found. Comput. Math..

[15]  E. Forgy,et al.  Cluster analysis of multivariate data : efficiency versus interpretability of classifications , 1965 .

[16]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[17]  D. Hunter,et al.  Optimization Transfer Using Surrogate Objective Functions , 2000 .

[18]  Xiao-Li Meng,et al.  [Optimization Transfer Using Surrogate Objective Functions]: Discussion , 2000 .

[19]  D. Rubin Multiple Imputation After 18+ Years , 1996 .

[20]  J. William Ahwood,et al.  CLASSIFICATION , 1931, Foundations of Familiar Language.

[21]  Hsiu J. Ho,et al.  On fast supervised learning for normal mixture models with missing information , 2006, Pattern Recognit..

[22]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[23]  J. Glasby All together now... , 2003, Nature Reviews Microbiology.

[24]  Lynette A. Hunt,et al.  Mixture model clustering for mixed data with missing information , 2003, Comput. Stat. Data Anal..

[25]  D. Rubin,et al.  Statistical Analysis with Missing Data. , 1989 .

[26]  K. Wagstaff,et al.  Making the Most of Missing Values: Object Clustering with Partial Data in Astronomy , 2004 .

[27]  Vincent Kanade,et al.  Clustering Algorithms , 2021, Wireless RF Energy Transfer in the Massive IoT Era.

[28]  A. Gelman,et al.  Multiple Imputation with Diagnostics (mi) in R: Opening Windows into the Black Box , 2011 .

[29]  Boris Mirkin,et al.  Mathematical Classification and Clustering , 1996 .

[30]  Michael I. Jordan,et al.  Supervised learning from incomplete data via an EM approach , 1993, NIPS.

[31]  John K. Dixon,et al.  Pattern Recognition with Partly Missing Data , 1979, IEEE Transactions on Systems, Man, and Cybernetics.

[32]  Michael I. Jordan,et al.  Revisiting k-means: New Algorithms via Bayesian Nonparametrics , 2011, ICML.

[33]  J. A. Hartigan,et al.  A k-means clustering algorithm , 1979 .

[34]  K. Lange,et al.  EM algorithms without missing data , 1997, Statistical methods in medical research.

[35]  Lena Osterhagen,et al.  Multiple Imputation For Nonresponse In Surveys , 2016 .

[36]  Emmanuel J. Candès,et al.  A Singular Value Thresholding Algorithm for Matrix Completion , 2008, SIAM J. Optim..

[37]  Gary King,et al.  Amelia II: A Program for Missing Data , 2011 .

[38]  D. Rubin,et al.  Multiple Imputation for Nonresponse in Surveys , 1989 .

[39]  Stef van Buuren,et al.  MICE: Multivariate Imputation by Chained Equations in R , 2011 .