Unsupervised Text Learning Based on Context Mixture Model with Dirichlet Prior

In this paper, we proposed a bayesian mixture model, in which introduce a context variable, which has Dirichlet prior, in a bayesian framework to model text multiple topics and then clustering. It is a novel unsupervised text learning algorithm to cluster large-scale web data. In addition, parameters estimation we adopt Maximum Likelihood (ML) and EM algorithm to estimate the model parameters, and employed BIC principle to determine the number of clusters. Experimental results show that method we proposed distinctly outperformed baseline algorithms.

[1]  Christopher M. Bishop,et al.  Pattern Recognition and Machine Learning (Information Science and Statistics) , 2006 .

[2]  Hang Li,et al.  Topic Analysis Using a Finite Mixture Model , 2000, Inf. Process. Manag..

[3]  Byron Dom,et al.  An Information-Theoretic External Cluster-Validity Measure , 2002, UAI.

[4]  Adrian E. Raftery,et al.  Model-Based Clustering, Discriminant Analysis, and Density Estimation , 2002 .

[5]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[6]  Mark Sandler,et al.  On the use of linear programming for unsupervised text classification , 2005, KDD '05.

[7]  Katherine A. Heller,et al.  Bayesian hierarchical clustering , 2005, ICML.

[8]  Thomas L. Griffiths,et al.  Probabilistic author-topic models for information discovery , 2004, KDD.

[9]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[10]  Peter C. Cheeseman,et al.  Bayesian Classification (AutoClass): Theory and Results , 1996, Advances in Knowledge Discovery and Data Mining.

[11]  Andrew McCallum,et al.  Topic and Role Discovery in Social Networks , 2005, IJCAI.

[12]  Mark Sandler,et al.  Hierarchical mixture models: a probabilistic analysis , 2007, KDD '07.

[13]  Geoffrey J. McLachlan,et al.  Mixture models : inference and applications to clustering , 1989 .

[14]  Philip S. Yu,et al.  Building text classifiers using positive and unlabeled examples , 2003, Third IEEE International Conference on Data Mining.

[15]  Xin Liu,et al.  Document clustering with cluster refinement and model selection capabilities , 2002, SIGIR '02.

[16]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Analysis , 1999, UAI.

[17]  Nasser M. Nasrabadi,et al.  Pattern Recognition and Machine Learning , 2006, Technometrics.

[18]  Xu Ling,et al.  Topic sentiment mixture: modeling facets and opinions in weblogs , 2007, WWW '07.

[19]  Pedro Larrañaga,et al.  Learning Bayesian classifiers from positive and unlabeled examples , 2007, Pattern Recognit. Lett..

[20]  Geoffrey J. McLachlan,et al.  Finite Mixture Models , 2019, Annual Review of Statistics and Its Application.

[21]  Tom M. Mitchell,et al.  Text clustering with extended user feedback , 2006, SIGIR.

[22]  Philip S. Yu,et al.  Partially Supervised Classification of Text Documents , 2002, ICML.

[23]  Raymond J. Mooney,et al.  A probabilistic framework for semi-supervised clustering , 2004, KDD.