Non-redundant clustering with conditional ensembles

Data may often contain multiple plausible clusterings. In order to discover a clustering which is useful to the user, constrained clustering techniques have been proposed to guide the search. Typically, these techniques assume background knowledge in the form of explicit information about the desired clustering. In contrast, we consider the setting in which the background knowledge is instead about an undesired clustering. Such knowledge may be obtained from an existing classification or precedent algorithm. The problem is then to find a novel, "orthogonal" clustering in the data. We present a general algorithmic framework which makes use of cluster ensemble methods to solve this problem. One key advantage of this approach is that it takes a base clustering method which is used as a black box, allowing the practitioner to select the most appropriate clustering method for the domain. We present experimental results on synthetic and text data which establish the competitiveness of this framework.

[1]  Jan Havrda,et al.  Quantification method of classification processes. Concept of structural a-entropy , 1967, Kybernetika.

[2]  Mark A. Gluck,et al.  Information, Uncertainty and the Utility of Categories , 1985 .

[3]  Yoshua Bengio,et al.  Convergence Properties of the K-Means Algorithms , 1994, NIPS.

[4]  A. D. Gordon A survey of constrained classification , 1996 .

[5]  Tom M. Mitchell,et al.  Learning to Extract Symbolic Knowledge from the World Wide Web , 1998, AAAI/IAAI.

[6]  Claire Cardie,et al.  Clustering with Instance-Level Constraints , 2000, AAAI/IAAI.

[7]  Naftali Tishby,et al.  The information bottleneck method , 2000, ArXiv.

[8]  Byron Dom,et al.  An Information-Theoretic External Cluster-Validity Measure , 2002, UAI.

[9]  Dan Klein,et al.  From Instance-level Constraints to Space-Level Constraints: Making the Most of Prior Knowledge in Data Clustering , 2002, ICML.

[10]  Joydeep Ghosh,et al.  Cluster Ensembles --- A Knowledge Reuse Framework for Combining Multiple Partitions , 2002, J. Mach. Learn. Res..

[11]  Joydeep Ghosh,et al.  Cluster Ensembles A Knowledge Reuse Framework for Combining Partitionings , 2002, AAAI/IAAI.

[12]  Michael I. Jordan,et al.  Distance Metric Learning with Application to Clustering with Side-Information , 2002, NIPS.

[13]  Gal Chechik,et al.  Extracting Relevant Structures with Side Information , 2002, NIPS.

[14]  Marina Meila,et al.  Comparing Clusterings by the Variation of Information , 2003, COLT.

[15]  Ian Davidson,et al.  Speeding up k-means Clustering by Bootstrap Averaging , 2003 .

[16]  Anil K. Jain,et al.  Combining multiple weak clusterings , 2003, Third IEEE International Conference on Data Mining.

[17]  Thorsten Joachims,et al.  Learning a Distance Metric from Relative Comparisons , 2003, NIPS.

[18]  William F. Punch,et al.  Ensembles of partitions via data resampling , 2004, International Conference on Information Technology: Coding and Computing, 2004. Proceedings. ITCC 2004..

[19]  Raymond J. Mooney,et al.  Integrating constraints and metric learning in semi-supervised clustering , 2004, ICML.

[20]  Anil K. Jain,et al.  A Mixture Model for Clustering Ensembles , 2004, SDM.

[21]  Boris G. Mirkin,et al.  Reinterpreting the Category Utility Function , 2001, Machine Learning.

[22]  Ana L. N. Fred,et al.  Analysis of consensus partition in cluster ensemble , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[23]  Thomas Hofmann,et al.  Non-redundant data clustering , 2006, Knowledge and Information Systems.