Boosting multi-label hierarchical text categorization

Hierarchical Text Categorization (HTC) is the task of generating (usually by means of supervised learning algorithms) text classifiers that operate on hierarchically structured classification schemes. Notwithstanding the fact that most large-sized classification schemes for text have a hierarchical structure, so far the attention of text classification researchers has mostly focused on algorithms for “flat” classification, i.e. algorithms that operate on non-hierarchical classification schemes. These algorithms, once applied to a hierarchical classification problem, are not capable of taking advantage of the information inherent in the class hierarchy, and may thus be suboptimal, in terms of efficiency and/or effectiveness. In this paper we propose TreeBoost.MH, a multi-label HTC algorithm consisting of a hierarchical variant of AdaBoost.MH, a very well-known member of the family of “boosting” learning algorithms. TreeBoost.MH embodies several intuitions that had arisen before within HTC: e.g. the intuitions that both feature selection and the selection of negative training examples should be performed “locally”, i.e. by paying attention to the topology of the classification scheme. It also embodies the novel intuition that the weight distribution that boosting algorithms update at every boosting round should likewise be updated “locally”. All these intuitions are embodied within TreeBoost.MH in an elegant and simple way, i.e. by defining TreeBoost.MH as a recursive algorithm that uses AdaBoost.MH as its base step, and that recurs over the tree structure. We present the results of experimenting TreeBoost.MH on three HTC benchmarks, and discuss analytically its computational cost.

[1]  Kris Popat,et al.  A Hierarchical Model for Clustering and Categorising Documents , 2002, ECIR.

[2]  Ee-Peng Lim,et al.  Hierarchical text classification and evaluation , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[3]  Jian Tang,et al.  Hierarchical Classification of Documents with Error Control , 2001, PAKDD.

[4]  Andreas S. Weigend,et al.  A neural network approach to topic spotting , 1995 .

[5]  Yiming Yang,et al.  RCV1: A New Benchmark Collection for Text Categorization Research , 2004, J. Mach. Learn. Res..

[6]  Hwee Tou Ng,et al.  Feature selection, perceptron learning, and a usability case study for text categorization , 1997, SIGIR '97.

[7]  Andreas S. Weigend,et al.  Exploiting Hierarchy in Text Categorization , 1999, Information Retrieval.

[8]  Yoram Singer,et al.  Boosting and Rocchio applied to text filtering , 1998, SIGIR '98.

[9]  Thomas Hofmann,et al.  Hierarchical document categorization with support vector machines , 2004, CIKM '04.

[10]  Padmini Srinivasan,et al.  Hierarchical Text Categorization Using Neural Networks , 2004, Information Retrieval.

[11]  Susan T. Dumais,et al.  Hierarchical classification of Web content , 2000, SIGIR '00.

[12]  Yoram Singer,et al.  Improved Boosting Algorithms Using Confidence-rated Predictions , 1998, COLT' 98.

[13]  Tom M. Mitchell,et al.  Improving Text Classification by Shrinkage in a Hierarchy of Classes , 1998, ICML.

[14]  Yoram Singer,et al.  BoosTexter: A Boosting-based System for Text Categorization , 2000, Machine Learning.

[15]  Daphne Koller,et al.  Hierarchically Classifying Documents Using Very Few Words , 1997, ICML.

[16]  Mark A. Girolami,et al.  A Probabilistic Framework for the Hierarchic Organisation and Classification of Document Collections , 2004, Journal of Intelligent Information Systems.

[17]  Sholom M. Weiss,et al.  Automated learning of decision rules for text categorization , 1994, TOIS.

[18]  Yiming Yang,et al.  A scalability analysis of classifiers in text categorization , 2003, SIGIR.

[19]  Yiming Yang,et al.  Support vector machines classification with a very large-scale taxonomy , 2005, SKDD.

[20]  Fabrizio Sebastiani,et al.  On the Selection of Negative Examples for Hierarchical Text Categorization , 2007 .

[21]  Prabhakar Raghavan,et al.  Scalable feature selection, classification and signature generation for organizing large text databases into hierarchical topic taxonomies , 1998, The VLDB Journal.

[22]  Mark Stevenson,et al.  The Reuters Corpus Volume 1 -from Yesterday’s News to Tomorrow’s Language Resources , 2002, LREC.

[23]  Yiming Yang,et al.  A re-examination of text categorization methods , 1999, SIGIR '99.

[24]  W. Hays,et al.  Statistics (3rd ed.). , 1982 .

[25]  David D. Lewis,et al.  Representation and Learning in Information Retrieval , 1991 .

[26]  Thomas Hofmann,et al.  Text classification in a hierarchical mixture model for small training sets , 2001, CIKM '01.

[27]  Alessandro Sperduti,et al.  An improved boosting algorithm and its application to text categorization , 2000, CIKM '00.

[28]  Michelangelo Ceci,et al.  Classifying web documents in a hierarchy of categories: a comprehensive study , 2007, Journal of Intelligent Information Systems.

[29]  George Forman,et al.  A pitfall and solution in multi-class feature selection for text classification , 2004, ICML.