A New Conceptual Clustering Framework

We propose a new formulation of the conceptual clustering problem where the goal is to explicitly output a collection of simple and meaningful conjunctions of attributes that define the clusters. The formulation differs from previous approaches since the clusters discovered may overlap and also may not cover all the points. In addition, a point may be assigned to a cluster description even if it only satisfies most, and not necessarily all, of the attributes in the conjunction. Connections between this conceptual clustering problem and the maximum edge biclique problem are made. Simple, randomized algorithms are given that discover a collection of approximate conjunctive cluster descriptions in sublinear time.

[1]  J. Hartigan Direct Clustering of a Data Matrix , 1972 .

[2]  David G. Stork,et al.  Pattern Classification , 1973 .

[3]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[4]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[5]  Mihalis Yannakakis,et al.  Node-Deletion Problems on Bipartite Graphs , 1981, SIAM J. Comput..

[6]  Andrew V. Goldberg,et al.  Finding a Maximum Density Subgraph , 1984 .

[7]  Shokri Z. Selim,et al.  K-Means-Type Algorithms: A Generalized Convergence Theorem and Characterization of Local Optimality , 1984, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]  Jeffrey Scott Vitter,et al.  Random sampling with a reservoir , 1985, TOMS.

[9]  Pat Langley,et al.  Approaches to Conceptual Clustering , 1985, IJCAI.

[10]  David B. Shmoys,et al.  A unified approach to approximation algorithms for bottleneck problems , 1986, JACM.

[11]  Tomás Feder,et al.  Optimal algorithms for approximate clustering , 1988, STOC '88.

[12]  Rajeev Motwani,et al.  Clique partitions, graph compression and speeding-up algorithms , 1991, STOC '91.

[13]  Tomasz Imielinski,et al.  Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.

[14]  Marek Karpinski,et al.  Polynomial time approximation schemes for dense instances of NP-hard problems , 1995, STOC '95.

[15]  Wenceslas Fernandez de la Vega,et al.  MAX-CUT has a randomized approximation scheme in dense graphs , 1996, Random Struct. Algorithms.

[16]  W. Vega,et al.  MAX-CUT has a randomized approximation scheme in dense graphs , 1996, Random Struct. Algorithms.

[17]  Rajeev Motwani,et al.  Incremental clustering and dynamic information retrieval , 1997, STOC '97.

[18]  Dimitrios Gunopulos,et al.  Data mining, hypergraph transversals, and machine learning (extended abstract) , 1997, PODS.

[19]  Dimitrios Gunopulos,et al.  Data mining, hypergraph transversals, and machine learning (extended abstract) , 1997, PODS '97.

[20]  Dana Ron,et al.  Property testing and its connection to learning and approximation , 1998, JACM.

[21]  Dimitrios Gunopulos,et al.  Automatic subspace clustering of high dimensional data for data mining applications , 1998, SIGMOD '98.

[22]  Jon M. Kleinberg,et al.  Segmentation problems , 2004, JACM.

[23]  Jon M. Kleinberg,et al.  Inferring Web communities from link topology , 1998, HYPERTEXT '98.

[24]  Dorit S. Hochbaum,et al.  Approximating Clique and Biclique Problems , 1998, J. Algorithms.

[25]  Noga Alon,et al.  Efficient Testing of Large Graphs , 2000, 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039).

[26]  Johannes Gehrke,et al.  CACTUS—clustering categorical data using summaries , 1999, KDD '99.

[27]  Vijay V. Vazirani,et al.  Primal-dual approximation algorithms for metric facility location and k-median problems , 1999, 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039).

[28]  Alan M. Frieze,et al.  Quick Approximation to Matrices and Applications , 1999, Comb..

[29]  Sudipto Guha,et al.  ROCK: a robust clustering algorithm for categorical attributes , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[30]  Sudipto Guha,et al.  Improved combinatorial algorithms for the facility location and k-median problems , 1999, 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039).

[31]  Piotr Indyk,et al.  Sublinear time algorithms for metric space problems , 1999, STOC '99.

[32]  Ravi Kumar,et al.  Trawling the Web for Emerging Cyber-Communities , 1999, Comput. Networks.

[33]  David G. Stork,et al.  Pattern Classification (2nd ed.) , 1999 .

[34]  C. Lee Giles,et al.  Efficient identification of Web communities , 2000, KDD '00.

[35]  Jon M. Kleinberg,et al.  Clustering categorical data: an approach based on dynamical systems , 2000, The VLDB Journal.

[36]  George M. Church,et al.  Biclustering of Expression Data , 2000, ISMB.

[37]  Santosh S. Vempala,et al.  On clusterings-good, bad and spectral , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[38]  Moses Charikar,et al.  Greedy approximation algorithms for finding dense components in a graph , 2000, APPROX.

[39]  Kamesh Munagala,et al.  Local search heuristic for k-median and facility location problems , 2001, STOC '01.

[40]  Uriel Feige,et al.  The Dense k -Subgraph Problem , 2001, Algorithmica.

[41]  Leonard Pitt,et al.  Sublinear time approximate clustering , 2001, SODA '01.

[42]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[43]  Inderjit S. Dhillon,et al.  Co-clustering documents and words using bipartite spectral graph partitioning , 2001, KDD '01.

[44]  Uriel Feige,et al.  Relations between average case complexity and approximation complexity , 2002, STOC '02.

[45]  T. M. Murali,et al.  A Monte Carlo algorithm for fast projective clustering , 2002, SIGMOD '02.

[46]  David M. Mount,et al.  A local search approximation algorithm for k-means clustering , 2002, SCG '02.

[47]  Roded Sharan,et al.  Discovering statistically significant biclusters in gene expression data , 2002, ISMB.

[48]  T. M. Murali,et al.  Extracting Conserved Gene Expression Motifs from Gene Expression Data , 2002, Pacific Symposium on Biocomputing.

[49]  Rina Panigrahy,et al.  Better streaming algorithms for clustering problems , 2003, STOC '03.

[50]  Inderjit S. Dhillon,et al.  Information-theoretic co-clustering , 2003, KDD '03.

[51]  Nicole Immorlica,et al.  Approximation, Randomization, and Combinatorial Optimization.. Algorithms and Techniques , 2003, Lecture Notes in Computer Science.

[52]  Amos Fiat,et al.  Correlation Clustering - Minimizing Disagreements on Arbitrary Weighted Graphs , 2003, ESA.

[53]  Sudipto Guha,et al.  Clustering Data Streams: Theory and Practice , 2003, IEEE Trans. Knowl. Data Eng..

[54]  René Peeters,et al.  The maximum edge biclique problem is NP-complete , 2003, Discret. Appl. Math..

[55]  Rajeev Motwani,et al.  Maintaining variance and k-medians over data stream windows , 2003, PODS.

[56]  Mikkel Thorup,et al.  Quick k-Median, k-Center, and Facility Location for Sparse Graphs , 2001, SIAM J. Comput..

[57]  Noga Alon,et al.  Testing of Clustering , 2003, SIAM J. Discret. Math..

[58]  Leonard Pitt,et al.  Criteria for polynomial-time (conceptual) clustering , 2004, Machine Learning.

[59]  Arlindo L. Oliveira,et al.  Biclustering algorithms for biological data analysis: a survey , 2004, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[60]  Venkatesan Guruswami,et al.  Clustering with qualitative information , 2005, 44th Annual IEEE Symposium on Foundations of Computer Science, 2003. Proceedings..