Overlapping correlation clustering

We introduce a new approach for finding overlapping clusters given pairwise similarities of objects. In particular, we relax the problem of correlation clustering by allowing an object to be assigned to more than one cluster. At the core of our approach is an optimization problem in which each data point is mapped to a small set of labels, representing membership in different clusters. The objective is to find a mapping so that the given similarities between objects agree as much as possible with similarities taken over their label sets. The number of labels can vary across objects. To define a similarity between label sets, we consider two measures: (i) a 0–1 function indicating whether the two label sets have non-zero intersection and (ii) the Jaccard coefficient between the two label sets. The algorithm we propose is an iterative local-search method. The definitions of label set similarity give rise to two non-trivial optimization problems, which, for the measures of set-intersection and Jaccard, we solve using a greedy strategy and non-negative least squares, respectively. We also develop a distributed version of our algorithm based on the BSP model and implement it using a Pregel framework. Our algorithm uses as input pairwise similarities of objects and can thus be applied when clustering structured objects for which feature vectors are not available. As a proof of concept, we apply our algorithms on three different and complex application domains: trajectories, amino-acid sequences, and textual documents.

[1]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[2]  Roger N. Shepard,et al.  Additive clustering: Representation of similarities as combinations of discrete overlapping properties. , 1979 .

[3]  P. Arabie,et al.  Overlapping Clustering: A New Method for Product Positioning , 1981 .

[4]  James C. Bezdek,et al.  Relational duals of the c-means clustering algorithms , 1989, Pattern Recognit..

[5]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[6]  Leslie G. Valiant,et al.  A bridging model for parallel computation , 1990, CACM.

[7]  Sankar K. Pal,et al.  Fuzzy models for pattern recognition : methods that search for structures in data , 1992 .

[8]  A G Murzin,et al.  SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.

[9]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[10]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[11]  Alan M. Frieze,et al.  Min-wise independent permutations (extended abstract) , 1998, STOC '98.

[12]  Padhraic Smyth,et al.  Trajectory clustering with mixtures of regression models , 1999, KDD '99.

[13]  D. Madigan,et al.  Proceedings : KDD-99 : the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August 15-18, 1999, San Diego, California, USA , 1999 .

[14]  H. Sebastian Seung,et al.  Algorithms for Non-negative Matrix Factorization , 2000, NIPS.

[15]  Claire Cardie,et al.  Clustering with Instance-Level Constraints , 2000, AAAI/IAAI.

[16]  Alan M. Frieze,et al.  Min-Wise Independent Permutations , 2000, J. Comput. Syst. Sci..

[17]  George M. Church,et al.  Biclustering of Expression Data , 2000, ISMB.

[18]  Claire Cardie,et al.  Proceedings of the Eighteenth International Conference on Machine Learning, 2001, p. 577–584. Constrained K-means Clustering with Background Knowledge , 2022 .

[19]  Dan Klein,et al.  From Instance-level Constraints to Space-Level Constraints: Making the Most of Prior Knowledge in Data Clustering , 2002, ICML.

[20]  Clustering with qualitative information , 2003, 44th Annual IEEE Symposium on Foundations of Computer Science, 2003. Proceedings..

[21]  Daphne Koller,et al.  Decomposing Gene Expression into Cellular Processes , 2002, Pacific Symposium on Biocomputing.

[22]  James A. Casbon,et al.  Spectral clustering of protein sequences , 2006, Nucleic acids research.

[23]  Raymond J. Mooney,et al.  A probabilistic framework for semi-supervised clustering , 2004, KDD.

[24]  Priscilla K. Coe,et al.  Spatial and temporal interactions of elk, mule deer, and cattle. , 2004 .

[25]  Arindam Banerjee,et al.  Active Semi-Supervision for Pairwise Constrained Clustering , 2004, SDM.

[26]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[27]  Chaitanya Swamy,et al.  Correlation Clustering: maximizing agreements via semidefinite programming , 2004, SODA '04.

[28]  Avrim Blum,et al.  Correlation Clustering , 2004, Machine Learning.

[29]  Arlindo L. Oliveira,et al.  Biclustering algorithms for biological data analysis: a survey , 2004, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[30]  Daphne Koller,et al.  Probabilistic discovery of overlapping cellular processes and their regulation , 2004, J. Comput. Biol..

[31]  Aristides Gionis,et al.  Clustering aggregation , 2005, 21st International Conference on Data Engineering (ICDE'05).

[32]  Joydeep Ghosh,et al.  Model-based overlapping clustering , 2005, KDD '05.

[33]  M. Charikar,et al.  Aggregating inconsistent information: ranking and clustering , 2005, STOC '05.

[34]  Chris H. Q. Ding,et al.  On the Equivalence of Nonnegative Matrix Factorization and Spectral Clustering , 2005, SDM.

[35]  S. S. Ravi,et al.  Clustering with Constraints: Feasibility Issues and the k-Means Algorithm , 2005, SDM.

[36]  T. Vicsek,et al.  Uncovering the overlapping community structure of complex networks in nature and society , 2005, Nature.

[37]  Lei Chen,et al.  Robust and fast similarity search for moving object trajectories , 2005, SIGMOD '05.

[38]  Masaki Aono,et al.  Exploring overlapping clusters using dynamic re-scaling and sampling , 2006, Knowledge and Information Systems.

[39]  Rich Caruana,et al.  Meta Clustering , 2006, Sixth International Conference on Data Mining (ICDM'06).

[40]  Venkatesan Guruswami,et al.  Correlation clustering with a fixed number of clusters , 2005, SODA '06.

[41]  Pang-Ning Tan,et al.  Clustering in the Presence of Bridge-Nodes , 2006, SDM.

[42]  E. Milios,et al.  Model-based Overlapping Co-Clustering , 2006 .

[43]  Dino Pedreschi,et al.  Time-focused clustering of trajectories of moving objects , 2006, Journal of Intelligent Information Systems.

[44]  Amos Fiat,et al.  Correlation clustering in general weighted graphs , 2006, Theor. Comput. Sci..

[45]  Jae-Gil Lee,et al.  Trajectory clustering: a partition-and-group framework , 2007, SIGMOD '07.

[46]  S. S. Ravi,et al.  Intractability and clustering with constraints , 2007, ICML '07.

[47]  Pauli Miettinen,et al.  On the Positive-Negative Partial Set Cover problem , 2008, Inf. Process. Lett..

[48]  Hui Xiong,et al.  Characterizing pattern preserving clustering , 2008, Knowledge and Information Systems.

[49]  Qiang Fu,et al.  Multiplicative Mixture Models for Overlapping Clustering , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[50]  Huan Liu,et al.  Scalable learning of collective behavior based on sparse social dimensions , 2009, CIKM.

[51]  James Bailey,et al.  A clustering comparison measure using density profiles and its application to the discovery of alternate clusterings , 2010, Data Mining and Knowledge Discovery.

[52]  Tamás Nepusz,et al.  SCPS: a fast implementation of a spectral method for detecting protein families on a genome-wide scale , 2010, BMC Bioinformatics.

[53]  Richard J. Hathaway,et al.  Density-Weighted Fuzzy c-Means Clustering , 2009, IEEE Transactions on Fuzzy Systems.

[54]  Qiang Fu,et al.  Bayesian Overlapping Subspace Clustering , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[55]  Nir Ailon,et al.  Correlation Clustering Revisited: The "True" Cost of Error Minimization Problems , 2009, ICALP.

[56]  Santo Fortunato,et al.  Community detection in graphs , 2009, ArXiv.

[57]  Ian Davidson,et al.  A principled and flexible framework for finding alternative clusterings , 2009, KDD.

[58]  M. Shahriar Hossain,et al.  Unifying dependent clustering and disparate clustering for non-homogeneous data , 2010, KDD.

[59]  Mohammad Al Hasan,et al.  SimClus: an effective algorithm for clustering with a lower bound on similarity , 2010, Knowledge and Information Systems.

[60]  Aart J. C. Bik,et al.  Pregel: a system for large-scale graph processing , 2010, SIGMOD Conference.

[61]  Huan Liu,et al.  Discovering Overlapping Groups in Social Media , 2010, 2010 IEEE International Conference on Data Mining.

[62]  Sergei Vassilvitskii,et al.  Finding the Jaccard median , 2010, SODA '10.

[63]  Jian-Ping Mei,et al.  Fuzzy clustering with weighted medoids for relational data , 2010, Pattern Recognit..

[64]  Edward R. Scheinerman,et al.  Modeling graphs using dot product representations , 2010, Comput. Stat..

[65]  Zhaoshui He,et al.  Symmetric Nonnegative Matrix Factorization: Algorithms and Applications to Probabilistic Clustering , 2011, IEEE Transactions on Neural Networks.

[66]  Ruoyu Li,et al.  Data Mining Based Full Ceramic Bearing Fault Diagnostic System Using AE Sensors , 2011, IEEE Transactions on Neural Networks.

[67]  Emmanuel Müller,et al.  Discovering Multiple Clustering Solutions: Grouping Objects in Different Views of the Data , 2010, 2012 IEEE 28th International Conference on Data Engineering.