Effective measures for inter-document similarity

While supervised learning-to-rank algorithms have largely supplanted unsupervised query-document similarity measures for search, the exploration of query-document measures by many researchers over many years produced insights that might be exploited in other domains. For example, the BM25 measure substantially and consistently outperforms cosine across many tested environments, and potentially provides retrieval effectiveness approaching that of the best learning-to-rank methods over equivalent features sets. Other measures based on language modeling and divergence from randomness can outperform BM25 in some circumstances. Despite this evidence, cosine remains the prevalent method for determining inter-document similarity for clustering and other applications. However, recent research demonstrates that BM25 terms weights can significantly improve clustering. In this work, we extend that result, presenting and evaluating novel inter-document similarity measures based on BM25, language modeling, and divergence from randomness. In our first experiment we analyze the accuracy of nearest neighborhoods when using our measures. In our second experiment, we analyze using clustering algorithms in conjunction with our measures. Our novel symmetric BM25 and language modeling similarity measures outperform alternative measures in both experiments. This outcome strongly recommends the adoption of these measures, replacing cosine similarity in future work.

[1]  Benjamin C. M. Fung,et al.  Hierarchical Document Clustering using Frequent Itemsets , 2003, SDM.

[2]  George Karypis,et al.  Evaluation of hierarchical clustering algorithms for document datasets , 2002, CIKM '02.

[3]  William M. Rand,et al.  Objective Criteria for the Evaluation of Clustering Methods , 1971 .

[4]  W. Marsden I and J , 2012 .

[5]  Stephen E. Robertson,et al.  Okapi at TREC-3 , 1994, TREC.

[6]  Stephen E. Robertson,et al.  GatfordCentre for Interactive Systems ResearchDepartment of Information , 1996 .

[7]  Charles L. A. Clarke,et al.  Improving document clustering using Okapi BM25 feature weighting , 2011, Information Retrieval.

[8]  Christopher J. C. Burges,et al.  A machine learning approach for improved BM25 retrieval , 2009, CIKM.

[9]  Thorsten Joachims,et al.  Optimizing search engines using clickthrough data , 2002, KDD.

[10]  Quoc V. Le,et al.  Learning to Rank with Nonsmooth Cost Functions , 2006, Neural Information Processing Systems.

[11]  Martin Ester,et al.  Frequent term-based text clustering , 2002, KDD.

[12]  Xiaohua Hu,et al.  Exploiting Wikipedia as external knowledge for document clustering , 2009, KDD.

[13]  George Karypis,et al.  Empirical and Theoretical Comparisons of Selected Criterion Functions for Document Clustering , 2004, Machine Learning.

[14]  Xin Liu,et al.  Document clustering based on non-negative matrix factorization , 2003, SIGIR.

[15]  Joydeep Ghosh,et al.  Value-based customer grouping from large retail data sets , 2000, SPIE Defense + Commercial Sensing.

[16]  Rong Jin,et al.  Meta-scoring: automatically evaluating term weighting schemes in IR without precision-recall , 2001, SIGIR '01.

[17]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .

[18]  Naftali Tishby,et al.  Document clustering using word clusters via the information bottleneck method , 2000, SIGIR '00.

[19]  Hongyuan Zha,et al.  A General Boosting Method and its Application to Learning Ranking Functions for Web Search , 2007, NIPS.

[20]  G. Karypis,et al.  Criterion Functions for Document Clustering ∗ Experiments and Analysis , 2001 .

[21]  C. J. van Rijsbergen,et al.  Probabilistic models of information retrieval based on measuring the divergence from randomness , 2002, TOIS.

[22]  Vipin Kumar,et al.  Document Categorization and Query Generation on the World Wide Web Using WebACE , 1999, Artificial Intelligence Review.

[23]  James Bailey,et al.  Information theoretic measures for clusterings comparison: is a correction for chance necessary? , 2009, ICML '09.

[24]  Charles L. A. Clarke,et al.  Information Retrieval - Implementing and Evaluating Search Engines , 2010 .

[25]  CHENGXIANG ZHAI,et al.  A study of smoothing methods for language models applied to information retrieval , 2004, TOIS.

[26]  Joan Claudi Socoró,et al.  Feature diversity in cluster ensembles for robust document clustering , 2006, SIGIR '06.

[27]  Gregory N. Hullender,et al.  Learning to rank using gradient descent , 2005, ICML.