Binary to Bushy: Bayesian Hierarchical Clustering with the Beta Coalescent

Discovering hierarchical regularities in data is a key problem in interacting with large datasets, modeling cognition, and encoding knowledge. A previous Bayesian solution—Kingman's coalescent—provides a probabilistic model for data represented as a binary tree. Unfortunately, this is inappropriate for data better described by bushier trees. We generalize an existing belief propagation framework of Kingman's coalescent to the beta coalescent, which models a wider range of tree structures. Because of the complex combinatorial search over possible structures, we develop new sampling schemes using sequential Monte Carlo and Dirichlet process mixture models, which render inference efficient and tractable. We present results on synthetic and real data that show the beta coalescent outperforms Kingman's coalescent and is qualitatively better at capturing data in bushy hierarchies.

[1]  Robert L. Mercer,et al.  Class-Based n-gram Models of Natural Language , 1992, CL.

[2]  Michael I. Jordan,et al.  Tree-Structured Stick Breaking for Hierarchical Data , 2010, NIPS.

[3]  Zoubin Ghahramani,et al.  Pitman-Yor Diffusion Trees , 2011, UAI.

[4]  Nathanael Berestycki,et al.  Recent progress in coalescent theory , 2009, Ensaios Matemáticos.

[5]  P. Anandan,et al.  Hierarchical Model-Based Motion Estimation , 1992, ECCV.

[6]  Anil K. Jain Data clustering: 50 years beyond K-means , 2008, Pattern Recognit. Lett..

[7]  Yee Whye Teh,et al.  Bayesian Rose Trees , 2010, UAI.

[8]  Katherine A. Heller,et al.  Bayesian hierarchical clustering , 2005, ICML.

[9]  Xavier Carreras,et al.  Simple Semi-supervised Dependency Parsing , 2008, ACL.

[10]  David M. W. Powers,et al.  Unsupervised learning of linguistic structure An empirical evaluation , 2003 .

[11]  Yee Whye Teh,et al.  Bayesian Agglomerative Clustering with Coalescents , 2007, NIPS.

[12]  N. Gordon,et al.  Novel approach to nonlinear/non-Gaussian Bayesian state estimation , 1993 .

[13]  C. V. Jongeneel,et al.  An atlas of human gene expression from massively parallel signature sequencing (MPSS). , 2005, Genome research.

[14]  Simon J. Godsill,et al.  An Overview of Existing Methods and Recent Advances in Sequential Monte Carlo , 2007, Proceedings of the IEEE.

[15]  J. Pitman Coalescents with multiple collisions , 1999 .

[16]  C. Antoniak Mixtures of Dirichlet Processes with Applications to Bayesian Nonparametric Problems , 1974 .

[17]  Timothy J. Robinson,et al.  Sequential Monte Carlo Methods in Practice , 2003 .

[18]  Radford M. Neal Slice Sampling , 2003, The Annals of Statistics.

[19]  M E J Newman,et al.  Community structure in social and biological networks , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[20]  Hal Daumé,et al.  The Infinite Hierarchical Factor Regression Model , 2008, NIPS.

[21]  Carl E. Rasmussen,et al.  The Infinite Gaussian Mixture Model , 1999, NIPS.

[22]  Dilan Görür,et al.  Scalable Inference on Kingman's Coalescent using Pair Similarity , 2012, AISTATS.

[23]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[24]  Matthias Birkner,et al.  Importance sampling for Lambda-coalescents in the infinitely many sites model. , 2011, Theoretical population biology.

[25]  J. Felsenstein Maximum-likelihood estimation of evolutionary trees from continuous characters. , 1973, American journal of human genetics.

[26]  Radford M. Neal Annealed importance sampling , 1998, Stat. Comput..

[27]  Jonathon Shlens,et al.  A Tutorial on Principal Component Analysis , 2014, ArXiv.

[28]  Radford M. Neal Markov Chain Sampling Methods for Dirichlet Process Mixture Models , 2000 .

[29]  Serik Sagitov,et al.  The general coalescent with asynchronous mergers of ancestral lines , 1999 .

[30]  Xiaojin Zhu,et al.  Incorporating domain knowledge into topic modeling via Dirichlet Forest priors , 2009, ICML '09.

[31]  J. Kingman On the genealogy of large populations , 1982, Journal of Applied Probability.

[32]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[33]  Neil J. Gordon,et al.  Editors: Sequential Monte Carlo Methods in Practice , 2001 .

[34]  Radford M. Neal,et al.  Density Modeling and Clustering Using Dirichlet Diffusion Trees , 2003 .

[35]  Philip Resnik,et al.  Holistic Sentiment Analysis Across Languages: Multilingual Supervised Latent Dirichlet Allocation , 2010, EMNLP.