论文信息 - k-POD: A Method for k-Means Clustering of Missing Data

k-POD: A Method for k-Means Clustering of Missing Data

The k-means algorithm is often used in clustering applications but its usage requires a complete data matrix. Missing data, however, are common in many applications. Mainstream approaches to clustering missing data reduce the missing data problem to a complete data formulation through either deletion or imputation but these solutions may incur significant costs. Our k-POD method presents a simple extension of k-means clustering for missing data that works even when the missingness mechanism is unknown, when external information is unavailable, and when there is significant missingness in the data. [Received November 2014. Revised August 2015.]

Richard G. Baraniuk | Eric C. Chi | Jocelyn T. Chi | Richard Baraniuk

[1] Sanjoy Dasgupta,et al. Random projection trees for vector quantization , 2008, 2008 46th Annual Allerton Conference on Communication, Control, and Computing.

[2] Sergei Vassilvitskii,et al. k-means++: the advantages of careful seeding , 2007, SODA '07.

[3] J. MacQueen. Some methods for classification and analysis of multivariate observations , 1967 .

[4] 鳥居泰彦,et al. 世界経済・社会統計 = World development indicators , 1998 .

[5] J. V. Ryzin,et al. Clustering Algorithms@@@Cluster Analysis Algorithms@@@Classification and Clustering , 1981 .

[6] R Core Team,et al. R: A language and environment for statistical computing. , 2014 .

[7] Pierre Hansen,et al. NP-hardness of Euclidean sum-of-squares clustering , 2008, Machine Learning.

[8] Roderick J. A. Little,et al. Statistical Analysis with Missing Data: Little/Statistical Analysis with Missing Data , 2002 .

[9] G. Kalton,et al. Handling missing data in survey research , 1996, Statistical methods in medical research.

[10] Xiaogang Wang,et al. CLUES: A non-parametric clustering method based on local shrinking , 2007, Comput. Stat. Data Anal..

[11] Robert Tibshirani,et al. Spectral Regularization Algorithms for Learning Large Incomplete Matrices , 2010, J. Mach. Learn. Res..

[12] Yehuda Koren,et al. All Together Now: A Perspective on the Netflix Prize , 2010 .

[13] William M. Rand,et al. Objective Criteria for the Evaluation of Clustering Methods , 1971 .

[14] Emmanuel J. Candès,et al. Exact Matrix Completion via Convex Optimization , 2008, Found. Comput. Math..

[15] E. Forgy,et al. Cluster analysis of multivariate data : efficiency versus interpretability of classifications , 1965 .

[16] Ali S. Hadi,et al. Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[17] D. Hunter,et al. Optimization Transfer Using Surrogate Objective Functions , 2000 .

[18] Xiao-Li Meng,et al. [Optimization Transfer Using Surrogate Objective Functions]: Discussion , 2000 .

[19] D. Rubin. Multiple Imputation After 18+ Years , 1996 .

[20] J. William Ahwood,et al. CLASSIFICATION , 1931, Foundations of Familiar Language.

[21] Hsiu J. Ho,et al. On fast supervised learning for normal mixture models with missing information , 2006, Pattern Recognit..

[22] S. P. Lloyd,et al. Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[23] J. Glasby. All together now... , 2003, Nature Reviews Microbiology.

[24] Lynette A. Hunt,et al. Mixture model clustering for mixed data with missing information , 2003, Comput. Stat. Data Anal..

[25] D. Rubin,et al. Statistical Analysis with Missing Data. , 1989 .

[26] K. Wagstaff,et al. Making the Most of Missing Values: Object Clustering with Partial Data in Astronomy , 2004 .

[27] Vincent Kanade,et al. Clustering Algorithms , 2021, Wireless RF Energy Transfer in the Massive IoT Era.

[28] A. Gelman,et al. Multiple Imputation with Diagnostics (mi) in R: Opening Windows into the Black Box , 2011 .

[29] Boris Mirkin,et al. Mathematical Classification and Clustering , 1996 .

[30] Michael I. Jordan,et al. Supervised learning from incomplete data via an EM approach , 1993, NIPS.

[31] John K. Dixon,et al. Pattern Recognition with Partly Missing Data , 1979, IEEE Transactions on Systems, Man, and Cybernetics.

[32] Michael I. Jordan,et al. Revisiting k-means: New Algorithms via Bayesian Nonparametrics , 2011, ICML.

[33] J. A. Hartigan,et al. A k-means clustering algorithm , 1979 .

[34] K. Lange,et al. EM algorithms without missing data , 1997, Statistical methods in medical research.

[35] Lena Osterhagen,et al. Multiple Imputation For Nonresponse In Surveys , 2016 .

[36] Emmanuel J. Candès,et al. A Singular Value Thresholding Algorithm for Matrix Completion , 2008, SIAM J. Optim..

[37] Gary King,et al. Amelia II: A Program for Missing Data , 2011 .

[38] D. Rubin,et al. Multiple Imputation for Nonresponse in Surveys , 1989 .

[39] Stef van Buuren,et al. MICE: Multivariate Imputation by Chained Equations in R , 2011 .