A flexible ILP formulation for hierarchical clustering

Abstract Hierarchical clustering is a popular approach in a number of fields with many well known algorithms. However, all existing work to our knowledge implements a greedy heuristic algorithm with no explicit objective function. In this work we formalize hierarchical clustering as an integer linear programming (ILP) problem with a natural objective function and the dendrogram properties enforced as linear constraints. Our experimental work shows that even for small data sets finding the global optimum produces more accurate results. Formalizing hierarchical clustering as an ILP with constraints has several advantages beyond finding the global optima. Relaxing the dendrogram constraints such as transitivity can produce novel problem variations such as finding hierarchies with overlapping clusterings. It is also possible to add constraints to encode guidance such as must – link , cannot – link , must – link – before etc. Finally, though exact solvers exist for ILP we show that a simple randomized algorithm and a linear programming (LP) relaxation can be used to provide approximate solutions faster.

[1]  Scott T. Acton,et al.  Agglomerative clustering for image segmentation , 2002, Object recognition supported by user interaction for service robots.

[2]  M. Shahriar Hossain,et al.  Unifying dependent clustering and disparate clustering for non-homogeneous data , 2010, KDD.

[3]  Fan Chung,et al.  Spectral Graph Theory , 1996 .

[4]  John Riedl,et al.  An Algorithmic Framework for Performing Collaborative Filtering , 1999, SIGIR Forum.

[6]  T. Warnow,et al.  Perfect Phylogenetic Networks: A New Methodology for Reconstructing the Evolutionary History of Natural Languages , 2005 .

[7]  Robin Sibson,et al.  SLINK: An Optimally Efficient Algorithm for the Single-Link Cluster Method , 1973, Comput. J..

[8]  Brian Everitt,et al.  Cluster analysis , 1974 .

[9]  Rong Ge,et al.  Constraint-driven clustering , 2007, KDD '07.

[10]  Alexander Schrijver,et al.  Theory of linear and integer programming , 1986, Wiley-Interscience series in discrete mathematics and optimization.

[11]  S. S. Ravi,et al.  Agglomerative Hierarchical Clustering with Constraints: Theoretical and Empirical Results , 2005, PKDD.

[12]  George Karypis,et al.  Hierarchical Clustering Algorithms for Document Datasets , 2005, Data Mining and Knowledge Discovery.

[13]  Ian Davidson,et al.  Constrained Clustering: Advances in Algorithms, Theory, and Applications , 2008 .

[14]  John A. Hartigan,et al.  Clustering Algorithms , 1975 .

[15]  T. Sørensen,et al.  A method of establishing group of equal amplitude in plant sociobiology based on similarity of species content and its application to analyses of the vegetation on Danish commons , 1948 .

[16]  Ian Davidson,et al.  Incorporating SAT solvers into hierarchical clustering algorithms: an efficient and flexible approach , 2011, KDD.

[17]  R. Sokal,et al.  Numerical Taxonomy: The Principles and Practice of Numerical Classification. , 1975 .

[18]  Ian Davidson,et al.  Efficient hierarchical clustering of large high dimensional datasets , 2013, CIKM.

[19]  Daniel T. Larose,et al.  Discovering Knowledge in Data: An Introduction to Data Mining , 2005 .

[20]  Stefan Kramer,et al.  Integer Linear Programming Models for Constrained Clustering , 2010, Discovery Science.

[21]  Jon R. Kettenring,et al.  The Practice of Cluster Analysis , 2006, J. Classif..

[22]  Andreas Nürnberger,et al.  Personalized Hierarchical Clustering , 2006, 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2006 Main Conference Proceedings)(WI'06).

[23]  Laurence A. Wolsey,et al.  Integer and Combinatorial Optimization , 1988 .

[24]  Charu C. Aggarwal,et al.  Social Network Data Analytics , 2011 .

[25]  Dominik Benz,et al.  Evaluation Strategies for Learning Algorithms of Hierarchies , 2008, GfKl.

[26]  Jon R. Kettenring,et al.  A Perspective on Cluster Analysis , 2008, Stat. Anal. Data Min..

[27]  T. Postelnicu,et al.  A “Natural” Agglomerative Clustering Method for Biology , 1991 .

[28]  Robert R. Sokal,et al.  A statistical method for evaluating systematic relationships , 1958 .