Evaluating and Comparing Text Clustering Results

Text clustering is a useful and inexpensive way to organize vast text repositories into meaningful topics categories. However, there is little consensus on which clustering techniques work best and in what circumstances because researchers do not use the same evaluation methodologies and document collections. Furthermore, text clustering offers a low cost alternative to supervised classification, which relies on expensive and difficult to handcraft labeled training data. However, there is no means to compare both approaches and decide which one would be best in a particular situation. In this paper, we propose and experiment with a framework that allows one to effectively compare text clustering results among themselves and with supervised text categorization.

[1]  David R. Karger,et al.  Scatter/Gather: a cluster-based approach to browsing large document collections , 1992, SIGIR '92.

[2]  David E. Johnson,et al.  Maximizing Text-Mining Performance , 1999 .

[3]  Alexander Dekhtyar,et al.  Information Retrieval , 2018, Lecture Notes in Computer Science.

[4]  Alan D. Marwick,et al.  Knowledge management technology , 2001, IBM Syst. J..

[5]  Louis Massey,et al.  On the quality of ART1 text clustering , 2003, Neural Networks.

[6]  Marti A. Hearst Text Data Mining , 2005 .

[7]  Giuseppe Attardi,et al.  Theseus: Categorization by Context , 2000 .

[8]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[9]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .

[10]  Chinatsu Aone,et al.  Fast and effective text mining using linear-time document clustering , 1999, KDD '99.

[11]  G. W. Milligan,et al.  The Effect of Cluster Size, Dimensionality, and the Number of Clusters on Recovery of True Cluster Structure , 1983, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[13]  Yiming Yang,et al.  A re-examination of text categorization methods , 1999, SIGIR '99.

[14]  D. Merkl Text Data Mining , 1998 .

[15]  Shi Bing,et al.  Inductive learning algorithms and representations for text categorization , 2006 .

[16]  Joydeep Ghosh,et al.  Competitive learning mechanisms for scalable, incremental and balanced clustering of streaming texts , 2003, Proceedings of the International Joint Conference on Neural Networks, 2003..

[17]  Chris Buckley,et al.  OHSUMED: an interactive retrieval evaluation and new large test collection for research , 1994, SIGIR '94.

[18]  Byron Dom,et al.  An Information-Theoretic External Cluster-Validity Measure , 2002, UAI.

[19]  Hang Li,et al.  Text classification using ESC-based stochastic decision lists , 1999, CIKM '99.

[20]  Yiming Yang,et al.  A study of thresholding strategies for text categorization , 2001, SIGIR '01.

[21]  Andreas Rudolph,et al.  Techniques of Cluster Algorithms in Data Mining , 2002, Data Mining and Knowledge Discovery.

[22]  Mohamed S. Kamel,et al.  Document clustering using hierarchical SOMART neural network , 2003, Proceedings of the International Joint Conference on Neural Networks, 2003..

[23]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[24]  Yoram Singer,et al.  Context-sensitive learning methods for text categorization , 1996, SIGIR '96.

[25]  Padmini Srinivasan,et al.  Hierarchical neural networks for text categorization (poster abstract) , 1999, SIGIR '99.

[26]  Inderjit S. Dhillon,et al.  Concept Decompositions for Large Sparse Text Data Using Clustering , 2004, Machine Learning.