Restrictive clustering and metaclustering for self-organizing document collections

This paper addresses the problem of automatically structuring heterogenous document collections by using clustering methods. In contrast to traditional clustering, we study restrictive methods and ensemble-based meta methods that may decide to leave out some documents rather than assigning them to inappropriate clusters with low confidence. These techniques result in higher cluster purity, better overall accuracy, and make unsupervised self-organization more robust. Our comprehensive experimental studies on three different real-world data collections demonstrate these benefits. The proposed methods seem particularly suitable for automatically substructuring personal email folders or personal Web directories that are populated by focused crawlers, and they can be combined with supervised classification techniques.

[1]  Cherié L. Weible,et al.  The Internet Movie Database , 2001 .

[2]  Gerhard Weikum,et al.  The BINGO! System for Information Portal Generation and Expert Web Search , 2003, CIDR.

[3]  Van Rijsbergen,et al.  A theoretical basis for the use of co-occurence data in information retrieval , 1977 .

[4]  David D. Lewis,et al.  Evaluating Text Categorization I , 1991, HLT.

[5]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[6]  Jörg Rech,et al.  Knowledge Discovery in Databases , 2001, Künstliche Intell..

[7]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[8]  Matsumoto Yuji,et al.  Document Clustering : Before and After the Singular Value Decomposition , 1999 .

[9]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[10]  Susan T. Dumais,et al.  Using Linear Algebra for Intelligent Information Retrieval , 1995, SIAM Rev..

[11]  Gerhard Weikum,et al.  BINGO!: bookmark-induced gathering of information , 2002, Proceedings of the Third International Conference on Web Information Systems Engineering, 2002. WISE 2002..

[12]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[13]  Ana L. N. Fred,et al.  Robust data clustering , 2003, 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings..

[14]  Petra Perner,et al.  Data Mining - Concepts and Techniques , 2002, Künstliche Intell..

[15]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[16]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[17]  Joydeep Ghosh,et al.  Cluster Ensembles --- A Knowledge Reuse Framework for Combining Multiple Partitions , 2002, J. Mach. Learn. Res..

[18]  J. A. Hartigan,et al.  A k-means clustering algorithm , 1979 .