Document Clustering Evaluation: Divergence from a Random Baseline

Divergence from a random baseline is a technique for the evaluation of document clustering. It ensures cluster quality measures are performing work that prevents ineffective clusterings from giving high scores to clusterings that provide no useful result. These concepts are defined and analysed using intrinsic and extrinsic approaches to the evaluation of document cluster quality. This includes the classical clusters to categories approach and a novel approach that uses ad hoc information retrieval. The divergence from a random baseline approach is able to differentiate ineffective clusterings encountered in the INEX XML Mining track. It also appears to perform a normalisation similar to the Normalised Mutual Information (NMI) measure but it can be applied to any measure of cluster quality. When it is applied to the intrinsic measure of distortion as measured by RMSE, subtraction from a random baseline provides a clear optimum that is not apparent otherwise. This approach can be applied to any clustering evaluation. This paper describes its use in the context of document clustering evaluation.

[1]  Ludovic Denoyer,et al.  Report on the XML mining track at INEX 2007 categorization and clustering of XML documents , 2008, SIGF.

[2]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[3]  Feng Liang,et al.  PKU at INEX 2010 XML Mining Track , 2010, INEX.

[4]  Inderjit S. Dhillon,et al.  Clustering with Bregman Divergences , 2005, J. Mach. Learn. Res..

[5]  J. Bezdek,et al.  FCM: The fuzzy c-means clustering algorithm , 1984 .

[7]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[8]  Charles L. A. Clarke,et al.  Improving document clustering using Okapi BM25 feature weighting , 2011, Information Retrieval.

[9]  Shokri Z. Selim,et al.  K-Means-Type Algorithms: A Generalized Convergence Theorem and Characterization of Local Optimality , 1984, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[10]  Theodore Kalamboukis,et al.  Using clustering to enhance text classification , 2007, SIGIR.

[11]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[12]  Marti A. Hearst,et al.  Reexamining the cluster hypothesis: scatter/gather on retrieval results , 1996, SIGIR '96.

[13]  Vipin Kumar,et al.  Chameleon: Hierarchical Clustering Using Dynamic Modeling , 1999, Computer.

[14]  Charles L. A. Clarke,et al.  Overview of the TREC 2010 Web Track , 2010, TREC.

[15]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[16]  Ellen M. Voorhees,et al.  The Philosophy of Information Retrieval Evaluation , 2001, CLEF.

[17]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[18]  SaltonGerard,et al.  Term-weighting approaches in automatic text retrieval , 1988 .

[19]  Ludovic Denoyer,et al.  Report on the XML mining track at INEX 2005 and INEX 2006: categorization and clustering of XML documents , 2007, SIGF.

[20]  Ludovic Denoyer,et al.  Overview of the INEX 2008 XML Mining Track , 2008, INEX.

[21]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[22]  Chris H. Q. Ding,et al.  K-means clustering via principal component analysis , 2004, ICML.

[23]  Ah-Hwee Tan,et al.  Text Mining: The state of the art and the challenges , 2000 .

[24]  Richi Nayak,et al.  Data Mining and XML Documents , 2002, International Conference on Internet Computing.

[25]  James Allan,et al.  A cluster-based resampling method for pseudo-relevance feedback , 2008, SIGIR '08.

[26]  Isabelle Guyon,et al.  Clustering: Science or Art? , 2009, ICML Unsupervised and Transfer Learning.

[27]  Ludovic Denoyer,et al.  Report on the XML Mining Track at INEX 2005 and INEX 2006 , 2006, INEX.

[28]  Vladimir Estivill-Castro,et al.  Why so many clustering algorithms: a position paper , 2002, SKDD.

[29]  Zellig S. Harris,et al.  Distributional Structure , 1954 .

[30]  Xin Liu,et al.  Document clustering based on non-negative matrix factorization , 2003, SIGIR.

[31]  C. J. van Rijsbergen,et al.  The use of hierarchic clustering in information retrieval , 1971, Inf. Storage Retr..

[32]  Andrew Trotman,et al.  Overview of the INEX 2010 Ad Hoc Track , 2010, INEX.

[33]  Bodo Manthey,et al.  k-Means Has Polynomial Smoothed Complexity , 2009, 2009 50th Annual IEEE Symposium on Foundations of Computer Science.

[34]  Philip S. Yu,et al.  Top 10 algorithms in data mining , 2007, Knowledge and Information Systems.

[35]  Yuen Ren Chao,et al.  Human Behavior and the Principle of Least Effort: An Introduction to Human Ecology , 1950 .

[36]  Azucena Montes Rendón,et al.  An Iterative Clustering Method for the XML-Mining Task of the INEX 2010 , 2010, INEX.

[37]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .

[38]  Andrew Trotman,et al.  Fast and Effective Focused Retrieval , 2009, INEX.

[39]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[40]  Shlomo Geva,et al.  TOPSIG: topology preserving document signatures , 2011, CIKM '11.

[41]  Richi Nayak,et al.  Overview of the INEX 2009 XML Mining Track: Clustering and Classification of XML Documents , 2009, INEX.

[42]  Stephen E. Robertson,et al.  Okapi at TREC-3 , 1994, TREC.

[43]  Stephen E. Robertson,et al.  GatfordCentre for Interactive Systems ResearchDepartment of Information , 1996 .