Strategies for minimising errors in hierarchical web categorisation

On the Web, browsing and searching categories is a popular method of finding documents. Two well-known category-based search systems are the Yahoo!~and DMOZ hierarchies, which are maintained by experts who assign documents to categories. However, manual categorisation by experts is costly, subjective, and not scalable with the increasing volumes of data that must be processed. Several methods have been investigated for effective automatic text categorisation. These include selection of categorisation methods, selection of pre-categorised training samples, use of hierachies, and selection of document fragments or features. In this paper, we further investigate categorisation into Web hierarchies and the role of hierarchical information in improving categorisation effectiveness. We introduce new strategies to reduce errors in hierarchical categorisation. In particular, we propose novel techniques that shift the assignment into higher level categories when lower level assignment is uncertain. Our results show that absolute error rates can be reduced by over 2%.

[1]  Daphne Koller,et al.  Hierarchically Classifying Documents Using Very Few Words , 1997, ICML.

[2]  James P. Callan,et al.  Training algorithms for linear text classifiers , 1996, SIGIR '96.

[3]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.

[4]  Aaron Kershenbaum,et al.  The Effect of Using Hierarchical Classifiers in Text Categorization , 2000, RIAO.

[5]  Susan T. Dumais,et al.  Hierarchical classification of Web content , 2000, SIGIR '00.

[6]  Aaron Kershenbaum,et al.  Category Levels in Hierarchical Text Categorization , 1998, EMNLP.

[7]  Sholom M. Weiss,et al.  Automated learning of decision rules for text categorization , 1994, TOIS.

[8]  Padmini Srinivasan,et al.  Hierarchical neural networks for text categorization , 1999, SIGIR 1999.

[9]  Andreas S. Weigend,et al.  Exploiting Hierarchy in Text Categorization , 1999, Information Retrieval.

[10]  Gerard Salton,et al.  Research and Development in Information Retrieval , 1982, Lecture Notes in Computer Science.

[11]  Gerald Salton,et al.  Automatic text processing , 1988 .

[12]  Thorsten Joachims,et al.  A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization , 1997, ICML.

[13]  Hugh E. Williams,et al.  Fast Categorisation of Large Document Collections , 2001, SPIRE.

[14]  Yiming Yang,et al.  An Evaluation of Statistical Approaches to Text Categorization , 1999, Information Retrieval.

[15]  Padmini Srinivasan,et al.  Hierarchical neural networks for text categorization (poster abstract) , 1999, SIGIR '99.

[16]  W. Bruce Croft,et al.  Combining classifiers in text categorization , 1996, SIGIR '96.

[17]  Hugh E. Williams,et al.  Searchable words on the Web , 2005, International Journal on Digital Libraries.

[18]  J. J. Rocchio,et al.  Relevance feedback in information retrieval , 1971 .

[19]  hierarchyDunja Mladeni Feature Selection for Classiication Based on Text Hierarchy , 1998 .

[20]  Philip J. Hayes,et al.  CONSTRUE/TIS: A System for Content-Based Indexing of a Database of News Stories , 1990, IAAI.