Hierarchical Dirichlet model for document classification

The proliferation of text documents on the web as well as within institutions necessitates their convenient organization to enable efficient retrieval of information. Although text corpora are frequently organized into concept hierarchies or taxonomies, the classification of the documents into the hierarchy is expensive in terms human effort. We present a novel and simple hierarchical Dirichlet generative model for text corpora and derive an efficient algorithm for the estimation of model parameters and the unsupervised classification of text documents into a given hierarchy. The class conditional feature means are assumed to be inter-related due to the hierarchical Bayesian structure of the model. We show that the algorithm provides robust estimates of the classification parameters by performing smoothing or regularization. We present experimental evidence on real web data that our algorithm achieves significant gains in accuracy over simpler models.

[1]  Ee-Peng Lim,et al.  Hierarchical text classification and evaluation , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[2]  Padmini Srinivasan,et al.  Hierarchical Text Categorization Using Neural Networks , 2004, Information Retrieval.

[3]  Andrew McCallum,et al.  Text Classification by Bootstrapping with Keywords, EM and Shrinkage , 1999 .

[4]  Prabhakar Raghavan,et al.  Using Taxonomy, Discriminants, and Signatures for Navigating in Text Databases , 1997, VLDB.

[5]  Jerry Nedelman,et al.  Book review: “Bayesian Data Analysis,” Second Edition by A. Gelman, J.B. Carlin, H.S. Stern, and D.B. Rubin Chapman & Hall/CRC, 2004 , 2005, Comput. Stat..

[6]  Peter Hooper Dependent Dirichlet Priors and Optimal Linear Estimators for Belief Net Parameters , 2004, UAI.

[7]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[8]  Tom M. Mitchell,et al.  Improving Text Classification by Shrinkage in a Hierarchy of Classes , 1998, ICML.

[9]  Thomas Hofmann,et al.  Learning with Taxonomies: Classifying Documents and Words , 2003 .

[10]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[11]  Susan T. Dumais,et al.  Hierarchical classification of Web content , 2000, SIGIR '00.

[12]  T. Moon,et al.  Mathematical Methods and Algorithms for Signal Processing , 1999 .

[13]  Ben Taskar,et al.  Probabilistic Classification and Clustering in Relational Data , 2001, IJCAI.

[14]  Ke Wang,et al.  Building Hierarchical Classifiers Using Class Proximity , 1999, VLDB.

[15]  B. Efron,et al.  Stein's Paradox in Statistics , 1977 .

[16]  Shui-Lung Chuang,et al.  Liveclassifier: creating hierarchical text classifiers through web corpora , 2004, WWW '04.

[17]  Michelangelo Ceci,et al.  Hierarchical Classification of HTML Documents with WebClassII , 2003, ECIR.

[18]  Dell Zhang,et al.  Web taxonomy integration using support vector machines , 2004, WWW '04.

[19]  David B. Dunson,et al.  Bayesian Data Analysis , 2010 .

[20]  Jianchang Mao,et al.  Hierarchical Bayes for Text Classification , 2000, PRICAI Workshop on Text and Web Mining.