On Clustering Incomplete Data

We study fundamental clustering problems for incomplete data. In this setting, we are given a set of incomplete d-dimensional Boolean vectors (representing the rows of a matrix), and the goal is to complete the missing vector entries so that the set of complete vectors admits a partitioning into at most k clusters with radius or diameter at most r. We develop a toolkit and use it to give tight characterizations of the parameterized complexity of these problems with respect to the parameters k, r, and the minimum number of rows and columns needed to cover all the missing entries. We show that the aforementioned problems are fixed-parameter tractable when parameterized by the three parameters combined, and that dropping any of these three parameters results in parameterized intractability. We extend this toolkit to settle the parameterized complexity of other clustering problems, answering an open question along the way. We also show how our results can be extended to data over any constant-size domain. A byproduct of our results is that, for the complete data setting, all problems under consideration are fixed-parameter tractable parameterized by k+r.

[1]  Rina Panigrahy,et al.  Clustering to minimize the sum of cluster diameters , 2001, STOC '01.

[2]  Ami Litman,et al.  On covering problems of codes , 1997, Theory of Computing Systems.

[3]  Prasad Raghavendra,et al.  Computational Limits for Matrix Completion , 2014, COLT.

[4]  Din J. Wasem,et al.  Mining of Massive Datasets , 2014 .

[5]  Alexandre Proutière,et al.  Optimal Cluster Recovery in the Labeled Stochastic Block Model , 2015, NIPS.

[6]  Tomás Feder,et al.  Optimal algorithms for approximate clustering , 1988, STOC '88.

[7]  Georg Gottlob,et al.  Fixed-Parameter Algorithms For Artificial Intelligence, Constraint Satisfaction and Database Problems , 2007, Comput. J..

[8]  David S. Johnson,et al.  Computers and In stractability: A Guide to the Theory of NP-Completeness. W. H Freeman, San Fran , 1979 .

[9]  R. Vidal,et al.  Sparse Subspace Clustering: Algorithm, Theory, and Applications. , 2013, IEEE transactions on pattern analysis and machine intelligence.

[10]  Jianhong Wu,et al.  Data clustering - theory, algorithms, and applications , 2007 .

[11]  Daniil Ryabko,et al.  Independence clustering (without a matrix) , 2017, NIPS.

[12]  Teofilo F. GONZALEZ,et al.  Clustering to Minimize the Maximum Intercluster Distance , 1985, Theor. Comput. Sci..

[13]  Petr A. Golovach,et al.  Parameterized k-Clustering: The distance matters! , 2019, ArXiv.

[14]  F. Harary,et al.  A survey of the theory of hypercube graphs , 1988 .

[15]  Andrzej Lingas,et al.  Approximation algorithms for Hamming clustering problems , 2004, J. Discrete Algorithms.

[16]  Fahad Panolan,et al.  Parameterized low-rank binary matrix approximation , 2020, Data Mining and Knowledge Discovery.

[17]  Emmanuel J. Candès,et al.  Exact Matrix Completion via Convex Optimization , 2009, Found. Comput. Math..

[18]  Gad M. Landau,et al.  On approximating string selection problems with outliers , 2013, Theor. Comput. Sci..

[19]  Andrzej Lingas,et al.  Efficient approximation algorithms for the Hamming center problem , 1999, SODA '99.

[20]  Laurent Bulteau,et al.  Consensus Strings with Small Maximum Distance and Small Distance Sum , 2019, Algorithmica.

[21]  Dimitris Sacharidis,et al.  Selecting representative and diverse spatio-textual posts over sliding windows , 2018, SSDBM.

[22]  Charu C. Aggarwal,et al.  Data Clustering: Algorithms and Applications , 2014 .

[23]  Ehsan Elhamifar,et al.  High-Rank Matrix Completion and Clustering under Self-Expressive Models , 2016, NIPS.

[24]  Michael R. Fellows,et al.  Review of: Fundamentals of Parameterized Complexity by Rodney G. Downey and Michael R. Fellows , 2015, SIGA.

[25]  Emmanuel J. Candès,et al.  The Power of Convex Relaxation: Near-Optimal Matrix Completion , 2009, IEEE Transactions on Information Theory.

[26]  Petr Gregor,et al.  Hamiltonian paths with prescribed edges in hypercubes , 2007, Discret. Math..

[27]  Anand Rajaraman,et al.  Mining of Massive Datasets , 2011 .

[28]  Leonid Libkin,et al.  Elements Of Finite Model Theory (Texts in Theoretical Computer Science. An Eatcs Series) , 2004 .

[29]  Pawel Gawrychowski,et al.  Dispersion on Trees , 2017, ESA.

[30]  Jörg Flum,et al.  Parameterized Complexity Theory (Texts in Theoretical Computer Science. An EATCS Series) , 2006 .

[31]  Bin Ma,et al.  On the closest string and substring problems , 2002, JACM.

[32]  J. Spencer Intersection Theorems for Systems of Sets , 1977, Canadian Mathematical Bulletin.

[33]  Eli Upfal,et al.  MapReduce and Streaming Algorithms for Diversity Maximization in Metric Spaces of Bounded Doubling Dimension , 2016, Proc. VLDB Endow..

[34]  Ming Li,et al.  On the k-Closest Substring and k-Consensus Pattern Problems , 2004, CPM.

[35]  Hans L. Bodlaender,et al.  Partition Into Triangles on Bounded Degree Graphs , 2012, Theory of Computing Systems.

[36]  A. Frieze,et al.  A simple heuristic for the p-centre problem , 1985 .

[37]  David P. Dailey Uniqueness of colorability and colorability of planar 4-regular graphs are NP-complete , 1980, Discret. Math..

[38]  Rolf Niedermeier,et al.  Fixed-Parameter Algorithms for CLOSEST STRING and Related Problems , 2003, Algorithmica.

[39]  Bin Ma,et al.  Closest string with outliers , 2011, BMC Bioinformatics.

[40]  Jinfeng Yi,et al.  Robust Ensemble Clustering by Matrix Completion , 2012, 2012 IEEE 12th International Conference on Data Mining.

[41]  Varun Kanade,et al.  Clustering Redemption-Beyond the Impossibility of Kleinberg's Axioms , 2018, NeurIPS.

[42]  Aravind Srinivasan,et al.  Approximation algorithms for stochastic clustering , 2018, NeurIPS.

[43]  Emmanuel J. Candès,et al.  Matrix Completion With Noise , 2009, Proceedings of the IEEE.

[44]  Boris Mirkin,et al.  Clustering For Data Mining: A Data Recovery Approach (Chapman & Hall/Crc Computer Science) , 2005 .

[45]  Zhitang Chen,et al.  Causal Inference and Mechanism Clustering of a Mixture of Additive Noise Models , 2018, NeurIPS.

[46]  Stephan Kreutzer,et al.  Deciding first-order properties of nowhere dense graphs , 2013, STOC.

[47]  Robert D. Nowak,et al.  K-subspaces with missing data , 2012, 2012 IEEE Statistical Signal Processing Workshop (SSP).

[48]  Günter Rote,et al.  Geometric clustering: fixed-parameter tractability and lower bounds with respect to the dimension , 2008, SODA '08.

[49]  Fahad Panolan,et al.  Approximation Schemes for Low-rank Binary Matrix Approximation Problems , 2018, ACM Trans. Algorithms.

[50]  Hendrik W. Lenstra,et al.  Integer Programming with a Fixed Number of Variables , 1983, Math. Oper. Res..

[51]  Purnamrita Sarkar,et al.  Overlapping Clustering Models, and One (class) SVM to Bind Them All , 2018, NeurIPS.

[52]  Robert Ganian,et al.  Parameterized Algorithms for the Matrix Completion Problem , 2018, ICML.

[53]  Hanna M. Wallach,et al.  Flexible Models for Microclustering with Application to Entity Resolution , 2016, NIPS.