Hierarchical mixture models: a probabilistic analysis

Mixture models form one of the most widely used classes of generative models for describing structured and clustered data. In this paper we develop a new approach for the analysis of hierarchical mixture models. More specifically, using a text clustering problem as a motivation, we describe a natural generative process that creates a hierarchical mixture model for the data. In this process, an adversary starts with an arbitrary base distribution and then builds a topic hierarchy via some evolutionary process, where he controls the parameters of the process. We prove that under our assumptions, given a subset of topics that represent generalizations of one another (such as baseball → sports → base), for any document which was produced via some topic in this hierarchy, we can efficiently determine the most specialized topic in this subset, it still belongs to. The quality of the classification is independent of the total number of topics in the hierarchy and our algorithm does not need to know the total number of topics in advance. Our approach also yields an algorithm for clustering and unsupervised topical tree reconstruction. We validate our model by showing that properties predicted by our theoretical results carry over to real data. We then apply our clustering algorithm to two different datasets: (i) "20 newsgroups"[19] and (ii) a snapshot of abstracts of arXiv {2} (15 categories, ~240,000 abstracts). In both cases our algorithm performs extremely well.

[1]  Sampath Kannan,et al.  Efficient algorithms for inverting evolution , 1999, JACM.

[2]  Thomas Hofmann,et al.  Latent Class Models for Collaborative Filtering , 1999, IJCAI.

[3]  Katherine A. Heller,et al.  Bayesian hierarchical clustering , 2005, ICML.

[4]  Sanjoy Dasgupta,et al.  A Two-Round Variant of EM for Gaussian Mixtures , 2000, UAI.

[5]  W. Eric L. Grimson,et al.  Adaptive background mixture models for real-time tracking , 1999, Proceedings. 1999 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No PR00149).

[6]  Geoffrey J. McLachlan,et al.  Application of Mixture Models to Detect Differentially Expressed Genes , 2005, IDEAL.

[7]  Thomas L. Griffiths,et al.  Hierarchical Topic Models and the Nested Chinese Restaurant Process , 2003, NIPS.

[8]  Thomas Hofmann,et al.  The Cluster-Abstraction Model: Unsupervised Learning of Topic Hierarchies from Text Data , 1999, IJCAI.

[9]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[10]  Michael I. Jordan,et al.  Hierarchical Dirichlet Processes , 2006 .

[11]  Edward H. Adelson,et al.  A unified mixture framework for motion segmentation: incorporating spatial coherence and estimating the number of models , 1996, Proceedings CVPR IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[12]  Thomas Hofmann,et al.  Text classification in a hierarchical mixture model for small training sets , 2001, CIKM '01.

[13]  Russ Bubley,et al.  Randomized algorithms , 1995, CSUR.

[14]  Michael J. Black,et al.  Mixture models for optical flow computation , 1993, Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.

[15]  Nir Ailon,et al.  Fitting tree metrics: Hierarchical clustering and phylogeny , 2005, 46th Annual IEEE Symposium on Foundations of Computer Science (FOCS'05).

[16]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[17]  Mark Sandler,et al.  On the use of linear programming for unsupervised text classification , 2005, KDD '05.

[18]  W. Hoeffding Probability Inequalities for sums of Bounded Random Variables , 1963 .

[19]  L. A. Salter Algorithms for Phylogenetic Tree Reconstruction , 2007 .

[20]  Thomas Hofmann,et al.  Unsupervised Learning by Probabilistic Latent Semantic Analysis , 2004, Machine Learning.

[21]  Andris Ambainis,et al.  Nearly tight bounds on the learnability of evolution , 1997, Proceedings 38th Annual Symposium on Foundations of Computer Science.

[22]  Jon M. Kleinberg,et al.  Using mixture models for collaborative filtering , 2004, STOC '04.

[23]  Jon M. Kleinberg,et al.  On learning mixtures of heavy-tailed distributions , 2005, 46th Annual IEEE Symposium on Foundations of Computer Science (FOCS'05).

[24]  J. van Leeuwen,et al.  Intelligent Data Engineering and Automated Learning , 2003, Lecture Notes in Computer Science.

[25]  Geoffrey J. McLachlan,et al.  Mixture models : inference and applications to clustering , 1989 .