Fuzzy Clustering with Similarity Queries

The fuzzy or soft k-means objective is a popular generalization of the well-known kmeans problem, extending the clustering capability of the k-means to datasets that are uncertain, vague and otherwise hard to cluster. In this paper, we propose a semisupervised active clustering framework, where the learner is allowed to interact with an oracle (domain expert), asking for the similarity between a certain set of chosen items. We study the query and computational complexities of clustering in this framework. We prove that having a few of such similarity queries enables one to get a polynomial-time approximation algorithm to an otherwise conjecturally NP-hard problem. In particular, we provide algorithms for fuzzy clustering in this setting that ask O(poly(k) log n) similarity queries and run with polynomialtime-complexity, where n is the number of items. The fuzzy k-means objective is nonconvex, with k-means as a special case, and is equivalent to some other generic nonconvex problem such as non-negative matrix factorization. The ubiquitous Lloyd-type algorithms (or alternating-minimization algorithms) can get stuck at a local minima. Our results show that by making few similarity queries, the problem becomes easier to solve. Finally, we test our algorithms over real-world datasets, showing their effectiveness in real-world applications.

[1]  Tamara G. Kolda,et al.  Tensor Decompositions and Applications , 2009, SIAM Rev..

[2]  Divesh Srivastava,et al.  Robust Entity Resolution Using a CrowdOracle , 2018, IEEE Data Eng. Bull..

[3]  Shai Ben-David,et al.  Clustering with Same-Cluster Queries , 2016, NIPS.

[4]  Sanjay Subramanian,et al.  Correlation Clustering with Same-Cluster Queries Bounded by Optimal Cost , 2019, ESA.

[5]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[6]  Y. Fukuyama,et al.  A new method of choosing the number of clusters for the fuzzy c-mean method , 1989 .

[7]  James C. Bezdek,et al.  Pattern Recognition with Fuzzy Objective Function Algorithms , 1981, Advanced Applications in Pattern Recognition.

[8]  Silvio Lattanzi,et al.  Exact Recovery of Clusters in Finite Metric Spaces Using Oracle Queries , 2021, COLT.

[9]  Qian Wang,et al.  The range of the value for the fuzzifier of the fuzzy c-means algorithm , 2012, Pattern Recognit. Lett..

[10]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[11]  Arya Mazumdar,et al.  Semisupervised Clustering by Queries and Locally Encodable Source Coding , 2019, IEEE Transactions on Information Theory.

[12]  Nilesh N. Dalvi,et al.  Crowdsourcing Algorithms for Entity Resolution , 2014, Proc. VLDB Endow..

[13]  XieXuanli Lisa,et al.  A Validity Measure for Fuzzy Clustering , 1991 .

[14]  I-Hsiang Wang,et al.  Community Detection in Hypergraphs: Optimal Statistical Limit and Efficient Algorithms , 2018, AISTATS.

[15]  Aly A. Farag,et al.  A modified fuzzy c-means algorithm for bias field estimation and segmentation of MRI data , 2002, IEEE Transactions on Medical Imaging.

[16]  Olgica Milenkovic,et al.  Query K-means Clustering and the Double Dixie Cup Problem , 2018, NeurIPS.

[17]  Tim Kraska,et al.  CrowdER: Crowdsourcing Entity Resolution , 2012, Proc. VLDB Endow..

[18]  Xue-wen Chen,et al.  Mr.KNN: soft relevance for multi-label classification , 2010, CIKM.

[19]  Amit Kumar,et al.  Approximate Clustering with Same-Cluster Queries , 2017, ITCS.

[20]  Johannes Blömer,et al.  Coresets for Fuzzy K-Means with Applications , 2018, ISAAC.

[21]  Gerardo Beni,et al.  A Validity Measure for Fuzzy Clustering , 1991, IEEE Trans. Pattern Anal. Mach. Intell..

[22]  Ron Shamir,et al.  Clustering Gene Expression Patterns , 1999, J. Comput. Biol..

[23]  Johannes Blömer,et al.  A Theoretical Analysis of the Fuzzy K-Means Problem , 2016, 2016 IEEE 16th International Conference on Data Mining (ICDM).

[24]  Nikos D. Sidiropoulos,et al.  Non-Negative Matrix Factorization Revisited: Uniqueness and Algorithm for Symmetric Decomposition , 2014, IEEE Transactions on Signal Processing.

[25]  Maria-Florina Balcan,et al.  Clustering with Interactive Feedback , 2008, ALT.

[26]  Silvio Lattanzi,et al.  On Margin-Based Cluster Recovery with Oracle Queries , 2021, NeurIPS.

[27]  Ankur Moitra,et al.  Algorithmic Aspects of Machine Learning , 2018 .

[28]  Pavel Berkhin,et al.  A Survey of Clustering Data Mining Techniques , 2006, Grouping Multidimensional Data.

[29]  Jianxin Liu,et al.  A Novel Initialization Algorithm for Fuzzy C-means Problem , 2020, TAMC.

[30]  Isak Gath,et al.  Unsupervised Optimal Fuzzy Clustering , 1989, IEEE Trans. Pattern Anal. Mach. Intell..

[31]  P. Paatero,et al.  Positive matrix factorization: A non-negative factor model with optimal utilization of error estimates of data values† , 1994 .

[32]  Shihua Zhang,et al.  Identification of overlapping community structure in complex networks using fuzzy c-means clustering , 2007 .

[33]  F. Valafar Pattern Recognition Techniques in Microarray Data Analysis : A Survey , 2002 .

[34]  Arya Mazumdar,et al.  Clustering with Noisy Queries , 2017, NIPS.

[35]  Witold Pedrycz,et al.  Fuzzy clustering with partial supervision , 1997, IEEE Trans. Syst. Man Cybern. Part B.

[36]  Marek Karpinski,et al.  Approximation schemes for clustering problems , 2003, STOC '03.

[37]  Muriel Medard,et al.  Same-Cluster Querying for Overlapping Clusters , 2019, NeurIPS.

[38]  Tzong-Jer Chen,et al.  Fuzzy c-means clustering with spatial information for image segmentation , 2006, Comput. Medical Imaging Graph..

[39]  Kathrin Bujna Soft Clustering Algorithms - Theoretical and Practical Improvements , 2017 .

[40]  N. Sidiropoulos,et al.  On the uniqueness of multilinear decomposition of N‐way arrays , 2000 .

[41]  Avrim Blum,et al.  Foundations of Data Science , 2020 .

[42]  Kota Srinivas Reddy,et al.  Query Complexity of Heavy Hitter Estimation , 2020, 2021 IEEE International Symposium on Information Theory (ISIT).

[43]  Arya Mazumdar,et al.  Query Complexity of Clustering with Side Information , 2017, NIPS.

[44]  Chris H. Q. Ding,et al.  On the Equivalence of Nonnegative Matrix Factorization and Spectral Clustering , 2005, SDM.

[45]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[46]  Purnamrita Sarkar,et al.  On Mixed Memberships and Symmetric Nonnegative Matrix Factorizations , 2016, ICML.

[47]  Silvio Lattanzi,et al.  Exact Recovery of Mangled Clusters with Same-Cluster Queries , 2020, NeurIPS.

[48]  Nicolò Cesa-Bianchi,et al.  Correlation Clustering with Adaptive Similarity Queries , 2019, NeurIPS.

[49]  Arya Mazumdar,et al.  A Theoretical Analysis of First Heuristics of Crowdsourced Entity Resolution , 2017, AAAI.

[50]  Andrea Vattani The hardness of k-means clustering in the plane , 2010 .

[51]  Charalampos E. Tsourakakis,et al.  Predicting Positive and Negative Links with Noisy Queries: Theory & Practice , 2017, ArXiv.

[52]  Meena Mahajan,et al.  The Planar k-means Problem is NP-hard I , 2009 .