Gradient-based Hierarchical Clustering using Continuous Representations of Trees in Hyperbolic Space

Hierarchical clustering is typically performed using algorithmic-based optimization searching over the discrete space of trees. While these optimization methods are often effective, their discreteness restricts them from many of the benefits of their continuous counterparts, such as scalable stochastic optimization and the joint optimization of multiple objectives or components of a model (e.g. end-to-end training). In this paper, we present an approach for hierarchical clustering that searches over continuous representations of trees in hyperbolic space by running gradient descent. We compactly represent uncertainty over tree structures with vectors in the Poincare ball. We show how the vectors can be optimized using an objective related to recently proposed cost functions for hierarchical clustering (Dasgupta, 2016; Wang and Wang, 2018). Using our method with a mini-batch stochastic gradient descent inference procedure, we are able to outperform prior work on clustering millions of ImageNet images by 15 points of dendrogram purity. Further, our continuous tree representation can be jointly optimized in multi-task learning applications offering a 9 point improvement over baseline methods.

[1]  Christopher De Sa,et al.  Representation Tradeoffs for Hyperbolic Embeddings , 2018, ICML.

[2]  George Karypis,et al.  Evaluation of hierarchical clustering algorithms for document datasets , 2002, CIKM '02.

[3]  Gao Cong,et al.  Hyperbolic Recommender Systems , 2018, ArXiv.

[4]  Douwe Kiela,et al.  Poincaré Embeddings for Learning Hierarchical Representations , 2017, NIPS.

[5]  Moses Charikar,et al.  Approximate Hierarchical Clustering via Sparsest Cut and Spreading Metrics , 2016, SODA.

[6]  Christopher D. Manning,et al.  Improving Coreference Resolution by Learning Entity-Level Distributed Representations , 2016, ACL.

[7]  Gary Bécigneul,et al.  Poincaré GloVe: Hyperbolic Word Embeddings , 2018, ICLR.

[8]  Andrew McCallum,et al.  Author Disambiguation using Error-driven Machine Learning with a Ranking Loss Function , 2007 .

[9]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[10]  Rik Sarkar,et al.  Low Distortion Delaunay Embedding of Trees in Hyperbolic Plane , 2011, GD.

[11]  Akshay Krishnamurthy,et al.  A Hierarchical Algorithm for Extreme Clustering , 2017, KDD.

[12]  Jonathan Bingham,et al.  Visualizing large hierarchical clusters in hyperbolic space , 2000, Bioinform..

[13]  David Kempe,et al.  Adaptive Hierarchical Clustering Using Ordinal Queries , 2017, SODA.

[14]  Dingkang Wang,et al.  An Improved Cost Function for Hierarchical Cluster Trees , 2018, J. Comput. Geom..

[15]  Aapo Hyvärinen,et al.  Validating the independent components of neuroimaging time series via clustering and visualization , 2004, NeuroImage.

[16]  M. Spivak A comprehensive introduction to differential geometry , 1979 .

[17]  Luke S. Zettlemoyer,et al.  End-to-end Neural Coreference Resolution , 2017, EMNLP.

[18]  Thomas Hofmann,et al.  Hyperbolic Entailment Cones for Learning Hierarchical Embeddings , 2018, ICML.

[19]  Silvere Bonnabel,et al.  Stochastic Gradient Descent on Riemannian Manifolds , 2011, IEEE Transactions on Automatic Control.

[20]  Moses Charikar,et al.  Hierarchical Clustering better than Average-Linkage , 2019, SODA.

[21]  Andrew McCallum,et al.  Linguistically-Informed Self-Attention for Semantic Role Labeling , 2018, EMNLP.

[22]  Eric P. Xing,et al.  Nonparametric Variational Auto-Encoders for Hierarchical Representation Learning , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[23]  Dit-Yan Yeung,et al.  A Convex Formulation for Learning Task Relationships in Multi-Task Learning , 2010, UAI.

[24]  Benjamin Moseley,et al.  Approximation Bounds for Hierarchical Clustering: Average Linkage, Bisecting K-means, and Local Search , 2017, NIPS.

[25]  Christian Sohler,et al.  BICO: BIRCH Meets Coresets for k-Means Clustering , 2013, ESA.

[26]  Marián Boguñá,et al.  Sustaining the Internet with Hyperbolic Mapping , 2010, Nature communications.

[27]  Tian Zhang,et al.  BIRCH: A New Data Clustering Algorithm and Its Applications , 1997, Data Mining and Knowledge Discovery.

[28]  Lars Schmidt-Thieme,et al.  BPR: Bayesian Personalized Ranking from Implicit Feedback , 2009, UAI.

[29]  Yee Whye Teh,et al.  Bayesian Rose Trees , 2010, UAI.

[30]  Katherine A. Heller,et al.  Bayesian hierarchical clustering , 2005, ICML.

[31]  Amin Vahdat,et al.  Hyperbolic Geometry of Complex Networks , 2010, Physical review. E, Statistical, nonlinear, and soft matter physics.

[32]  Dan Roth,et al.  Design Challenges and Misconceptions in Named Entity Recognition , 2009, CoNLL.

[33]  Suvrit Sra,et al.  Fast stochastic optimization on Riemannian manifolds , 2016, ArXiv.

[34]  Ramana Rao,et al.  Laying out and visualizing large trees using a hyperbolic space , 1994, UIST '94.

[35]  Heeyoung Lee,et al.  Joint Entity and Event Coreference Resolution across Documents , 2012, EMNLP.

[36]  Andreas Krause,et al.  Fast and Provably Good Seedings for k-Means , 2016, NIPS.

[37]  Zoubin Ghahramani,et al.  Pitman-Yor Diffusion Trees , 2011, UAI.

[38]  Andrew McCallum,et al.  A Discriminative Hierarchical Model for Fast Coreference at Large Scale , 2012, ACL.

[39]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[40]  Sanjoy Dasgupta,et al.  A cost function for similarity-based hierarchical clustering , 2015, STOC.

[41]  D. Sculley,et al.  Web-scale k-means clustering , 2010, WWW '10.

[42]  Michael I. Jordan,et al.  Tree-Structured Stick Breaking for Hierarchical Data , 2010, NIPS.

[43]  Santosh S. Vempala,et al.  A discriminative framework for clustering via similarity functions , 2008, STOC.

[44]  Sivaraman Balakrishnan,et al.  Efficient Active Algorithms for Hierarchical Clustering , 2012, ICML.

[45]  Grigory Yaroslavtsev,et al.  Hierarchical Clustering for Euclidean Data , 2018, AISTATS.

[46]  Claire Mathieu,et al.  Hierarchical Clustering , 2017, SODA.

[47]  Leonidas J. Guibas,et al.  Taskonomy: Disentangling Task Transfer Learning , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[48]  R. Tibshirani,et al.  Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[49]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[50]  Robert L. Mercer,et al.  Class-Based n-gram Models of Natural Language , 1992, CL.

[51]  Varun Kanade,et al.  Hierarchical Clustering Beyond the Worst-Case , 2017, NIPS.

[52]  Robert D. Kleinberg Geographic Routing Using Hyperbolic Space , 2007, IEEE INFOCOM 2007 - 26th IEEE International Conference on Computer Communications.

[53]  Aurko Roy,et al.  Hierarchical Clustering via Spreading Metrics , 2016, NIPS.

[54]  John Yen,et al.  An incremental approach to building a cluster hierarchy , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[55]  Alexander J. Smola,et al.  Taxonomy discovery for personalized recommendation , 2014, WSDM.

[56]  Anna Choromanska,et al.  Simultaneous Learning of Trees and Representations for Extreme Classification and Density Estimation , 2016, ICML.