Hierarchical Classification of HTML Documents with WebClassII

This paper describes a new method for the classification of a HTML document into a hierarchy of categories. The hierarchy of categories is involved in all phases of automated document classification, namely feature extraction, learning, and classification of a new document. The innovative aspects of this work are the feature selection process, the automated threshold determination for classification scores, and an experimental study on real-word Web documents that can be associated to any node in the hierarchy. Moreover, a new measure for the evaluation of system performances has been introduced in order to compare three different techniques (flat, hierarchical with proper training sets, hierarchical with hierarchical training sets). The method has been implemented in the context of a client-server application, named WebClassII. Results show that for hierarchical techniques it is better to use hierarchical training sets.

[1]  Xmldm,et al.  XML-Based Data Management and Multimedia Engineering — EDBT 2002 Workshops , 2002, Lecture Notes in Computer Science.

[2]  Donato Malerba,et al.  A Machine Learning Approach to Web Mining , 1999, AI*IA.

[3]  Hussein Almuallim,et al.  An Efficient Algorithm for Finding Optimal Gain-Ratio Multiple-Split Tests on Hierarchical Attributes in Decision Tree Learning , 1996, AAAI/IAAI, Vol. 1.

[4]  Cyril Cleverdon,et al.  Optimizing convenient online access to bibliographic databases , 1984 .

[5]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[6]  Michelangelo Ceci,et al.  Mining HTML Pages to Support Document Sharing in a Cooperative System , 2002, EDBT Workshops.

[7]  Evelina Lamma,et al.  AI*IA 99: Advances in Artificial Intelligence , 2000, Lecture Notes in Computer Science.

[8]  Tom M. Mitchell,et al.  Improving Text Classification by Shrinkage in a Hierarchy of Classes , 1998, ICML.

[9]  Pat Langley,et al.  Editorial: On Machine Learning , 1986, Machine Learning.

[10]  Mehran Sahami,et al.  Learning Limited Dependence Bayesian Classifiers , 1996, KDD.

[11]  M. F. Porter,et al.  An algorithm for suffix stripping , 1997 .

[12]  Dunja Mladenic,et al.  Machine Learning on non-homogeneous, distributed text data , 1998 .

[13]  Susan T. Dumais,et al.  Hierarchical classification of Web content , 2000, SIGIR '00.

[14]  Thorsten Joachims,et al.  A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization , 1997, ICML.

[15]  Daphne Koller,et al.  Hierarchically Classifying Documents Using Very Few Words , 1997, ICML.

[16]  Aaron Kershenbaum,et al.  The Effect of Using Hierarchical Classifiers in Text Categorization , 2000, RIAO.

[17]  Gerard Salton,et al.  Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer , 1989 .

[18]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.