Hierarchical Clustering with Structural Constraints

Hierarchical clustering is a popular unsupervised data analysis method. For many real-world applications, we would like to exploit prior information about the data that imposes constraints on the clustering hierarchy, and is not captured by the set of features available to the algorithm. This gives rise to the problem of "hierarchical clustering with structural constraints". Structural constraints pose major challenges for bottom-up approaches like average/single linkage and even though they can be naturally incorporated into top-down divisive algorithms, no formal guarantees exist on the quality of their output. In this paper, we provide provable approximation guarantees for two simple top-down algorithms, using a recently introduced optimization viewpoint of hierarchical clustering with pairwise similarity information [Dasgupta, 2016]. We show how to find good solutions even in the presence of conflicting prior information, by formulating a constraint-based regularization of the objective. We further explore a variation of this objective for dissimilarity information [Cohen-Addad et al., 2018] and improve upon current techniques. Finally, we demonstrate our approach on a real dataset for the taxonomy application.

[1]  Alfred V. Aho,et al.  Inferring a Tree from Lowest Common Ancestors with an Application to the Optimization of Relational Expressions , 1981, SIAM J. Comput..

[2]  Eli V. Olinick,et al.  The use of sparsest cuts to reveal the hierarchical community structure of social networks , 2008, Soc. Networks.

[3]  Aurko Roy,et al.  Hierarchical Clustering via Spreading Metrics , 2016, NIPS.

[4]  Varun Kanade,et al.  Hierarchical Clustering Beyond the Worst-Case , 2017, NIPS.

[5]  Thomas Mailund,et al.  Efficient algorithms for computing the triplet and quartet distance between trees of arbitrary degree , 2013, SODA.

[6]  Claire Cardie,et al.  Clustering with Instance-Level Constraints , 2000, AAAI/IAAI.

[7]  Sanjoy Dasgupta,et al.  A cost function for similarity-based hierarchical clustering , 2015, STOC.

[8]  Maria-Florina Balcan,et al.  Local algorithms for interactive clustering , 2013, ICML.

[9]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .

[10]  Sanjoy Dasgupta,et al.  Interactive Bayesian Hierarchical Clustering , 2016, ICML.

[11]  M. A. Muñoz,et al.  A novel brain partition highlights the modular skeleton shared by structure and function , 2014, Scientific Reports.

[12]  Maria-Florina Balcan,et al.  Robust hierarchical clustering , 2013, J. Mach. Learn. Res..

[13]  Claire Mathieu,et al.  Hierarchical Clustering , 2017, SODA.

[14]  Maria-Florina Balcan,et al.  Clustering with Interactive Feedback , 2008, ALT.

[15]  Raymond J. Mooney,et al.  Integrating constraints and metric learning in semi-supervised clustering , 2004, ICML.

[16]  Ronald L. Rivest,et al.  Introduction to Algorithms, third edition , 2009 .

[17]  Benjamin Moseley,et al.  Approximation Bounds for Hierarchical Clustering: Average Linkage, Bisecting K-means, and Local Search , 2017, NIPS.

[18]  Ulrike von Luxburg,et al.  Kernel functions based on triplet comparisons , 2016, NIPS.

[19]  Pavel Berkhin,et al.  A Survey of Clustering Data Mining Techniques , 2006, Grouping Multidimensional Data.

[20]  Satish Rao,et al.  Expander flows, geometric embeddings and graph partitioning , 2004, STOC '04.

[21]  Santosh S. Vempala,et al.  A discriminative framework for clustering via similarity functions , 2008, STOC.

[22]  Adam Tauman Kalai,et al.  Adaptively Learning the Crowd Kernel , 2011, ICML.

[23]  Matthias Hein,et al.  Constrained 1-Spectral Clustering , 2012, AISTATS.

[24]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[25]  David Kempe,et al.  Adaptive Hierarchical Clustering Using Ordinal Queries , 2017, SODA.

[26]  Moses Charikar,et al.  Approximate Hierarchical Clustering via Sparsest Cut and Spreading Metrics , 2016, SODA.

[27]  Fabrizio Lillo,et al.  Correlation, Hierarchies, and Networks in Financial Markets , 2008, 0809.4615.

[28]  Leo Keselman,et al.  Hierarchical Clustering with Structural Constraints , 2018 .

[29]  Claire Cardie,et al.  Proceedings of the Eighteenth International Conference on Machine Learning, 2001, p. 577–584. Constrained K-means Clustering with Background Knowledge , 2022 .