Incremental hierarchical clustering of text documents

Incremental hierarchical text document clustering algorithms are important in organizing documents generated from streaming on-line sources, such as, Newswire and Blogs. However, this is a relatively unexplored area in the text document clustering literature. Popular incremental hierarchical clustering algorithms, namely Cobweb and Classit, have not been widely used with text document data. We discuss why, in the current form, these algorithms are not suitable for text clustering and propose an alternative formulation that includes changes to the underlying distributional assumption of the algorithm in order to conform with the data. Both the original Classit algorithm and our proposed algorithm are evaluated using Reuters newswire articles and Ohsumed dataset.

[1]  Zhi-Qiang Liu,et al.  Refining web search engine results using incremental clustering , 2004, Int. J. Intell. Syst..

[2]  Alan F. Smeaton,et al.  An Architecture for Efficient Document Clustering and Retrieval on a Dynamic Collection of Newspaper Texts , 1998, BCS-IRSG Annual Colloquium on IR Research.

[3]  James Allan,et al.  On-Line New Event Detection and Tracking , 1998, SIGIR.

[4]  Andreas Rauber,et al.  Organizing and Exploring High-Dimensional Data with the Growing Hierarchical Self-Organizing Map , 2002, FSKD.

[5]  Douglas H. Fisher,et al.  Knowledge Acquisition Via Incremental Conceptual Clustering , 1987, Machine Learning.

[6]  David R. Karger,et al.  Scatter/Gather: a cluster-based approach to browsing large document collections , 1992, SIGIR '92.

[7]  George Forman,et al.  An Extensive Empirical Study of Feature Selection Metrics for Text Classification , 2003, J. Mach. Learn. Res..

[8]  Douglas H. Fisher,et al.  Knowledge acquisition via incremental conceptual clustering , 2004, Machine Learning.

[9]  Martin Franz,et al.  Unsupervised and supervised clustering for topic tracking , 2001, SIGIR '01.

[10]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[11]  Philip S. Yu,et al.  On the design of a learning crawler for topical resource discovery , 2001, TOIS.

[12]  Pat Langley,et al.  Models of Incremental Concept Formation , 1990, Artif. Intell..

[13]  Surithong Srisa‐ard,et al.  Mining the Web: Discovering Knowledge from Hypertext Data , 2003 .

[14]  Joydeep Ghosh,et al.  Competitive learning mechanisms for scalable, incremental and balanced clustering of streaming texts , 2003, Proceedings of the International Joint Conference on Neural Networks, 2003..

[15]  Yuen Ren Chao,et al.  Human Behavior and the Principle of Least Effort: An Introduction to Human Ecology , 1950 .

[16]  Peter C. Cheeseman,et al.  Bayesian Classification (AutoClass): Theory and Results , 1996, Advances in Knowledge Discovery and Data Mining.

[17]  Frederick Mosteller,et al.  Applied Bayesian and classical inference : the case of the Federalist papers , 1984 .

[18]  Víctor Pàmies,et al.  Open Directory Project , 2003 .

[19]  Soumen Chakrabarti,et al.  Mining the web - discovering knowledge from hypertext data , 2002 .

[20]  Anil K. Jain,et al.  Unsupervised Learning of Finite Mixture Models , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[21]  George Karypis,et al.  Empirical and Theoretical Comparisons of Selected Criterion Functions for Document Clustering , 2004, Machine Learning.

[22]  Slava M. Katz Distribution of content words and phrases in text and language modelling , 1996, Natural Language Engineering.

[23]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[24]  W. Bruce Croft,et al.  Cluster-based retrieval using language models , 2004, SIGIR '04.

[25]  George Karypis,et al.  Evaluation of hierarchical clustering algorithms for document datasets , 2002, CIKM '02.

[26]  Hector Garcia-Molina,et al.  Crawling the web: discovery and maintenance of large-scale web data , 2001 .

[27]  Even Flood,et al.  ODP, Open Directory Project , 2005 .

[28]  Yiming Yang,et al.  Topic Detection and Tracking Pilot Study Final Report , 1998 .

[29]  Yiming Yang,et al.  RCV1: A New Benchmark Collection for Text Categorization Research , 2004, J. Mach. Learn. Res..

[30]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[31]  Vipin Kumar,et al.  WebACE: a Web agent for document categorization and exploration , 1998, AGENTS '98.

[32]  M. F.,et al.  Bibliography , 1985, Experimental Gerontology.

[33]  Yiming Yang,et al.  A study of retrospective and on-line event detection , 1998, SIGIR '98.

[34]  Ernest Nagel,et al.  Scientific psychology : principles and approaches , 1965 .

[35]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[36]  Donna K. Harman,et al.  The Text REtrieval Conference (TREC) , 1999, NTCIR.

[37]  Padhraic Smyth,et al.  Clustering Using Monte Carlo Cross-Validation , 1996, KDD.

[38]  Sachindra Joshi,et al.  A matrix density based algorithm to hierarchically co-cluster documents and words , 2003, WWW '03.

[39]  Don R. Swanson,et al.  A decision theoretic foundation for indexing , 1975, J. Am. Soc. Inf. Sci..