Model-based overlapping clustering

While the vast majority of clustering algorithms are partitional, many real world datasets have inherently overlapping clusters. Several approaches to finding overlapping clusters have come from work on analysis of biological datasets. In this paper, we interpret an overlapping clustering model proposed by Segal et al. [23] as a generalization of Gaussian mixture models, and we extend it to an overlapping clustering model based on mixtures of any regular exponential family distribution and the corresponding Bregman divergence. We provide the necessary algorithm modifications for this extension, and present results on synthetic data as well as subsets of 20-Newsgroups and EachMovie datasets.

[1]  Paul J. Schweitzer,et al.  Problem Decomposition and Data Reorganization by a Clustering Technique , 1972, Oper. Res..

[2]  Sheldon M. Ross,et al.  Introduction to probability models , 1975 .

[3]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[4]  P. McCullagh,et al.  Generalized Linear Models , 1992 .

[5]  Sankar K. Pal,et al.  Fuzzy models for pattern recognition , 1992 .

[6]  Ken Lang,et al.  NewsWeeder: Learning to Filter Netnews , 1995, ICML.

[7]  J. Navarro-Pedreño Numerical Methods for Least Squares Problems , 1996 .

[8]  Eric Saund,et al.  Applying the Multiple Cause Mixture Model to Text Categorization , 1996, ICML.

[9]  Ramakrishnan Srikant,et al.  Mining generalized association rules , 1995, Future Gener. Comput. Syst..

[10]  Y. Censor,et al.  Parallel Optimization:theory , 1997 .

[11]  Avi Pfeffer,et al.  Probabilistic Frame-Based Systems , 1998, AAAI/IAAI.

[12]  Yiming Yang,et al.  A re-examination of text categorization methods , 1999, SIGIR '99.

[13]  Lise Getoor,et al.  Learning Probabilistic Relational Models , 1999, IJCAI.

[14]  H. Sebastian Seung,et al.  Algorithms for Non-negative Matrix Factorization , 2000, NIPS.

[15]  George M. Church,et al.  Biclustering of Expression Data , 2000, ISMB.

[16]  Hans Kellerer,et al.  The Multiple Subset Sum Problem , 2000, SIAM J. Optim..

[17]  Andrzej Stachurski,et al.  Parallel Optimization: Theory, Algorithms and Applications , 2000, Parallel Distributed Comput. Pract..

[18]  Ash A. Alizadeh,et al.  'Gene shaving' as a method for identifying distinct sets of genes with similar expression patterns , 2000, Genome Biology.

[19]  L. Lazzeroni Plaid models for gene expression data , 2000 .

[20]  Yair Censor,et al.  Proximity Function Minimization Using Multiple Bregman Projections, with Applications to Split Feasibility and Kullback–Leibler Distance Minimization , 2001, Ann. Oper. Res..

[21]  Sanjoy Dasgupta,et al.  A Generalization of Principal Components Analysis to the Exponential Family , 2001, NIPS.

[22]  C. Papadimitriou,et al.  On the value of private information , 2001 .

[23]  Eric R. Ziegel,et al.  Generalized Linear Models , 2002, Technometrics.

[24]  Geoffrey J. Gordon Generalized^2 Linear^2 Models , 2002, NIPS 2002.

[25]  Geoffrey J. Gordon Generalized2 Linear2 Models , 2002, NIPS.

[26]  Daphne Koller,et al.  Decomposing Gene Expression into Cellular Processes , 2002, Pacific Symposium on Biocomputing.

[27]  Inderjit S. Dhillon,et al.  Information theoretic clustering of sparse cooccurrence data , 2003, Third IEEE International Conference on Data Mining.

[28]  Raymond J. Mooney,et al.  A probabilistic framework for semi-supervised clustering , 2004, KDD.

[29]  Manfred K. Warmuth,et al.  Relative Loss Bounds for Multidimensional Regression Problems , 1997, Machine Learning.

[30]  Inderjit S. Dhillon,et al.  Concept Decompositions for Large Sparse Text Data Using Clustering , 2004, Machine Learning.

[31]  Vincent Conitzer,et al.  Computing Shapley Values, Manipulating Value Division Schemes, and Checking Core Membership in Multi-Issue Domains , 2004, AAAI.

[32]  Yoram Singer,et al.  Logistic Regression, AdaBoost and Bregman Distances , 2000, Machine Learning.

[33]  Arlindo L. Oliveira,et al.  Biclustering algorithms for biological data analysis: a survey , 2004, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[34]  Daphne Koller,et al.  Probabilistic discovery of overlapping cellular processes and their regulation , 2004, J. Comput. Biol..

[35]  Avanidhar Subrahmanyam,et al.  The Value of Private Information , 2005 .

[36]  David Pisinger,et al.  Where are the hard knapsack problems? , 2005, Comput. Oper. Res..

[37]  Inderjit S. Dhillon,et al.  Clustering with Bregman Divergences , 2005, J. Mach. Learn. Res..

[38]  Inderjit S. Dhillon,et al.  A generalized maximum entropy approach to bregman co-clustering and matrix approximation , 2004, J. Mach. Learn. Res..