Classifying web documents in a hierarchy of categories: a comprehensive study

Most of the research on text categorization has focused on classifying text documents into a set of categories with no structural relationships among them (flat classification). However, in many information repositories documents are organized in a hierarchy of categories to support a thematic search by browsing topics of interests. The consideration of the hierarchical relationship among categories opens several additional issues in the development of methods for automated document classification. Questions concern the representation of documents, the learning process, the classification process and the evaluation criteria of experimental results. They are systematically investigated in this paper, whose main contribution is a general hierarchical text categorization framework where the hierarchy of categories is involved in all phases of automated document classification, namely feature selection, learning and classification of a new document. An automated threshold determination method for classification scores is embedded in the proposed framework. It can be applied to any classifier that returns a degree of membership of a document to a category. In this work three learning methods are considered for the construction of document classifiers, namely centroid-based, naïve Bayes and SVM. The proposed framework has been implemented in the system WebClassIII and has been tested on three datasets (Yahoo, DMOZ, RCV1) which present a variety of situations in terms of hierarchical structure. Experimental results are reported and several conclusions are drawn on the comparison of the flat vs. the hierarchical approach as well as on the comparison of different hierarchical classifiers. The paper concludes with a review of related work and a discussion of previous findings vs. our findings.

[1]  D. Tikk,et al.  Experiment with a hierarchical text categorization method on the WIPO-alpha patent collection , 2003, Fourth International Symposium on Uncertainty Modeling and Analysis, 2003. ISUMA 2003..

[2]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[3]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[4]  Maurice Bruynooghe,et al.  Hierarchical multi-classification , 2002, KDD 2002.

[5]  Dunja Mladenic,et al.  Feature Subset Selection in Text-Learning , 1998, ECML.

[6]  Robert Tibshirani,et al.  Classification by Pairwise Coupling , 1997, NIPS.

[7]  Hans-Hermann Bock,et al.  Analysis of Symbolic Data , 2000 .

[8]  Fabrizio Sebastiani,et al.  Supervised term weighting for automated text categorization , 2003, SAC '03.

[9]  Michelangelo Ceci,et al.  Mining HTML Pages to Support Document Sharing in a Cooperative System , 2002, EDBT Workshops.

[10]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[11]  Dunja Mladenic,et al.  Feature Selection for Unbalanced Class Distribution and Naive Bayes , 1999, ICML.

[12]  Hwee Tou Ng,et al.  Feature selection, perceptron learning, and a usability case study for text categorization , 1997, SIGIR '97.

[13]  Yirong Shen,et al.  Improving the Performance of Naive Bayes for Text Classification , 2003 .

[14]  Dunja Mladenic,et al.  Machine Learning on non-homogeneous, distributed text data , 1998 .

[16]  Verayuth Lertnattee,et al.  Effect of term distributions on centroid-based text categorization , 2004, Inf. Sci..

[17]  Ee-Peng Lim,et al.  Hierarchical text classification and evaluation , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[18]  Dunja Mladenic,et al.  Feature selection on hierarchy of web documents , 2003, Decis. Support Syst..

[19]  Sang-Bum Kim,et al.  Effective Methods for Improving Naive Bayes Text Classifiers , 2002, PRICAI.

[20]  Gerard Salton,et al.  The SMART Retrieval System—Experiments in Automatic Document Processing , 1971 .

[21]  Diego Sona,et al.  Clustering with Propagation for Hierarchical Document Classification , 2004 .

[22]  Pavel Brazdil,et al.  Proceedings of the European Conference on Machine Learning , 1993 .

[23]  Sholom M. Weiss,et al.  Automated learning of decision rules for text categorization , 1994, TOIS.

[24]  Christiane Fellbaum,et al.  English Verbs as a Semantic Net , 1990 .

[25]  Yoram Singer,et al.  Boosting and Rocchio applied to text filtering , 1998, SIGIR '98.

[26]  Thorsten Joachims,et al.  A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization , 1997, ICML.

[27]  Mark A. Girolami,et al.  A Probabilistic Framework for the Hierarchic Organisation and Classification of Document Collections , 2004, Journal of Intelligent Information Systems.

[28]  J. J. Rocchio,et al.  Relevance feedback in information retrieval , 1971 .

[29]  Mehran Sahami,et al.  Learning Limited Dependence Bayesian Classifiers , 1996, KDD.

[30]  Tom M. Mitchell,et al.  Improving Text Classification by Shrinkage in a Hierarchy of Classes , 1998, ICML.

[31]  Yiming Yang,et al.  Modified Logistic Regression: An Approximation to SVM and Its Applications in Large-Scale Text Categorization , 2003, ICML.

[32]  George Karypis,et al.  Centroid-Based Document Classification: Analysis and Experimental Results , 2000, PKDD.

[33]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[34]  Yiming Yang,et al.  A re-examination of text categorization methods , 1999, SIGIR '99.

[35]  Paul N. Bennett Assessing the Calibration of Naive Bayes Posterior Estimates , 2000 .

[36]  Yiming Yang,et al.  RCV1: A New Benchmark Collection for Text Categorization Research , 2004, J. Mach. Learn. Res..

[37]  Tom M. Mitchell,et al.  Learning to construct knowledge bases from the World Wide Web , 2000, Artif. Intell..

[38]  Andreas S. Weigend,et al.  Exploiting Hierarchy in Text Categorization , 1999, Information Retrieval.

[39]  David D. Lewis,et al.  Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval , 1998, ECML.

[40]  Yiming Yang,et al.  An Evaluation of Statistical Approaches to Text Categorization , 1999, Information Retrieval.

[41]  Verayuth Lertnattee,et al.  Multi-Dimensional Text Classification , 2002, COLING.

[42]  Susan T. Dumais,et al.  Hierarchical classification of Web content , 2000, SIGIR '00.

[43]  Maria Simi,et al.  Experiments on the Use of Feature Selection and Negative Evidence in Automated Text Categorization , 2000, ECDL.

[44]  Yoram Singer,et al.  BoosTexter: A Boosting-based System for Text Categorization , 2000, Machine Learning.

[45]  Shi Bing,et al.  Inductive learning algorithms and representations for text categorization , 2006 .

[46]  Pedro M. Domingos,et al.  On the Optimality of the Simple Bayesian Classifier under Zero-One Loss , 1997, Machine Learning.

[47]  David Madigan,et al.  On the Naive Bayes Model for Text Categorization , 2003, AISTATS.

[48]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[49]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[50]  Daphne Koller,et al.  Hierarchically Classifying Documents Using Very Few Words , 1997, ICML.

[51]  Aaron Kershenbaum,et al.  The Effect of Using Hierarchical Classifiers in Text Categorization , 2000, RIAO.

[52]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[53]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[54]  Tsau Young Lin,et al.  Proceedings of the 2001 IEEE International Conference on Data Mining, 29 November - 2 December 2001, San Jose, California, USA , 2001 .

[55]  Y Yang,et al.  An evaluation of statistical approaches to MEDLINE indexing. , 1996, Proceedings : a conference of the American Medical Informatics Association. AMIA Fall Symposium.

[56]  Rohini K. Srihari,et al.  Feature selection for text categorization on imbalanced data , 2004, SKDD.

[57]  Michelangelo Ceci,et al.  Hierarchical Classification of HTML Documents with WebClassII , 2003, ECIR.

[58]  John C. Platt,et al.  Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[59]  Jihoon Yang,et al.  A Fast Algorithm for Hierarchical Text Classification , 2000, DaWaK.

[60]  Padmini Srinivasan,et al.  Hierarchical Text Categorization Using Neural Networks , 2004, Information Retrieval.