Using instance-level constraints in agglomerative hierarchical clustering: theoretical and empirical results

Clustering with constraints is a powerful method that allows users to specify background knowledge and the expected cluster properties. Significant work has explored the incorporation of instance-level constraints into non-hierarchical clustering but not into hierarchical clustering algorithms. In this paper we present a formal complexity analysis of the problem and show that constraints can be used to not only improve the quality of the resultant dendrogram but also the efficiency of the algorithms. This is particularly important since many agglomerative style algorithms have running times that are quadratic (or faster growing) functions of the number of instances to be clustered. We present several bounds on the improvement in the running times of algorithms obtainable using constraints.

[1]  George Karypis,et al.  Hierarchical Clustering Algorithms for Document Datasets , 2005, Data Mining and Knowledge Discovery.

[2]  Michael I. Jordan,et al.  Distance Metric Learning with Application to Clustering with Side-Information , 2002, NIPS.

[3]  Ian Davidson,et al.  Constrained Clustering: Advances in Algorithms, Theory, and Applications , 2008 .

[4]  Dan Klein,et al.  From Instance-level Constraints to Space-Level Constraints: Making the Most of Prior Knowledge in Data Clustering , 2002, ICML.

[5]  Arindam Banerjee,et al.  Semi-supervised Clustering by Seeding , 2002, ICML.

[6]  S. S. Ravi,et al.  Agglomerative Hierarchical Clustering with Constraints: Theoretical and Empirical Results , 2005, PKDD.

[7]  Arindam Banerjee,et al.  Active Semi-Supervision for Pairwise Constrained Clustering , 2004, SDM.

[8]  Claire Cardie,et al.  Proceedings of the Eighteenth International Conference on Machine Learning, 2001, p. 577–584. Constrained K-means Clustering with Background Knowledge , 2022 .

[9]  Harald Niederreiter,et al.  Probability and computing: randomized algorithms and probabilistic analysis , 2006, Math. Comput..

[10]  Charles Elkan,et al.  Using the Triangle Inequality to Accelerate k-Means , 2003, ICML.

[11]  Thomas J. Schaefer,et al.  The complexity of satisfiability problems , 1978, STOC.

[12]  S. S. Ravi,et al.  Intractability and clustering with constraints , 2007, ICML '07.

[13]  Mirco Nanni,et al.  Speeding-Up Hierarchical Agglomerative Clustering in Presence of Expensive Metrics , 2005, PAKDD.

[14]  Clifford Stein,et al.  Introduction to Algorithms, 2nd edition. , 2001 .

[15]  R. K. Shyamasundar,et al.  Introduction to algorithms , 1996 .

[16]  James Bailey,et al.  COALA: A Novel Approach for the Extraction of an Alternate Clustering of High Quality and High Dissimilarity , 2006, Sixth International Conference on Data Mining (ICDM'06).

[17]  T. Postelnicu,et al.  A “Natural” Agglomerative Clustering Method for Biology , 1991 .

[18]  Stan Matwin,et al.  Proceedings of the 11th European conference on Principles and Practice of Knowledge Discovery in Databases , 2007 .

[19]  S. S. Ravi,et al.  The complexity of non-hierarchical clustering with instance and cluster level constraints , 2007, Data Mining and Knowledge Discovery.

[20]  S. S. Ravi,et al.  Clustering with Constraints: Feasibility Issues and the k-Means Algorithm , 2005, SDM.

[21]  S. S. Ravi,et al.  Efficient incremental constrained clustering , 2007, KDD '07.

[22]  S. S. Ravi,et al.  Identifying and Generating Easy Sets of Constraints for Clustering , 2006, AAAI.

[23]  Ronald L. Graham,et al.  Concrete Mathematics, a Foundation for Computer Science , 1991, The Mathematical Gazette.

[24]  Claire Cardie,et al.  Clustering with Instance-Level Constraints , 2000, AAAI/IAAI.

[25]  Scott T. Acton,et al.  Agglomerative clustering for image segmentation , 2002, Object recognition supported by user interaction for service robots.

[26]  Ian Davidson,et al.  Measuring Constraint-Set Utility for Partitional Clustering Algorithms , 2006, PKDD.