Clustering documents into a web directory for bootstrapping a supervised classification

The management of hierarchically organized data is starting to play a key role in the knowledge management community due to the proliferation of topic hierarchies for text documents. The creation and maintenance of such organized repositories of information requires a great deal of human intervention.The machine learning community has partially addressed this problem by developing hierarchical supervised classifiers that help people categorize new resources within given hierarchies. The worst problem of hierarchical supervised classifiers, however, is their high demand in terms of labeled examples. The number of examples required is related to the number of topics in the taxonomy. Bootstrapping a huge hierarchy with a proper set of labeled examples is therefore a critical issue.This paper proposes some solutions for the bootstrapping problem, that implicitly or explicitly use taxonomy definition: a baseline approach that classifies documents according to the class terms, and two clustering approaches, whose training is constrained by the a priori knowledge encoded in the taxonomy structure, which consists of both terminological and relational aspects. In particular, we propose the Tax-SOM model, that clusters a set of documents in a predefined hierarchy of classes, directly exploiting the knowledge of both their topological organization and their lexical description. Experimental evaluation was performed on a set of taxonomies taken from the GoogleTM and LookSmartTM web directories, obtaining good results.

[1]  Ee-Peng Lim,et al.  Hierarchical text classification and evaluation , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[2]  Susan T. Dumais,et al.  Hierarchical classification of Web content , 2000, SIGIR '00.

[3]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[4]  Sebastian Thrun,et al.  Learning to Classify Text from Labeled and Unlabeled Documents , 1998, AAAI/IAAI.

[5]  Jian Tang,et al.  Hierarchical Classification of Documents with Error Control , 2001, PAKDD.

[6]  Philip S. Yu,et al.  Partially Supervised Classification of Text Documents , 2002, ICML.

[7]  Teuvo Kohonen,et al.  Self-Organizing Maps , 2010 .

[8]  Ke Wang,et al.  Building Hierarchical Classifiers Using Class Proximity , 1999, VLDB.

[9]  Yan Zhou,et al.  Enhancing Supervised Learning with Unlabeled Data , 2000, ICML.

[10]  Andrew McCallum,et al.  Toward Optimal Active Learning through Sampling Estimation of Error Reduction , 2001, ICML.

[11]  Víctor Pàmies,et al.  Open Directory Project , 2003 .

[12]  Padmini Srinivasan,et al.  Hierarchical Text Categorization Using Neural Networks , 2004, Information Retrieval.

[13]  Yiming Yang,et al.  Expert network: effective and efficient learning from human decisions in text categorization and retrieval , 1994, SIGIR '94.

[14]  M. Bonifacio,et al.  Enabling Distributed Knowledge Management: Managerial and Technological Implications , 2002 .

[15]  Diego Sona,et al.  Clustering documents in a web directory , 2003, WIDM '03.

[16]  Pedro M. Domingos,et al.  Learning to Match the Schemas of Data Sources: A Multistrategy Approach , 2003, Machine Learning.

[17]  Diego Sona,et al.  Bootstrapping for hierarchical document classification , 2003, CIKM '03.

[18]  Philip S. Yu,et al.  On the merits of building categorization systems by supervised clustering , 1999, KDD '99.

[19]  Andrew McCallum,et al.  Text Classification by Bootstrapping with Keywords, EM and Shrinkage , 1999 .

[20]  Luis Alfonso Ureña López,et al.  Integrating Linguistic Resources in TC through WSD , 2001, Comput. Humanit..

[21]  Michelangelo Ceci,et al.  Hierarchical Classification of HTML Documents with WebClassII , 2003, ECIR.

[22]  Prabhakar Raghavan,et al.  Using Taxonomy, Discriminants, and Signatures for Navigating in Text Databases , 1997, VLDB.

[23]  Daphne Koller,et al.  Hierarchically Classifying Documents Using Very Few Words , 1997, ICML.

[24]  P. Avesani,et al.  TaxE : a Testbed for Hierarchical Document Classifiers , 2004 .

[25]  Robert A. Jacobs,et al.  Hierarchical Mixtures of Experts and the EM Algorithm , 1993, Neural Computation.

[26]  David A. Landgrebe,et al.  Partially supervised classification using weighted unsupervised clustering , 1999, IEEE Trans. Geosci. Remote. Sens..

[27]  David A. Cohn,et al.  Active Learning with Statistical Models , 1996, NIPS.