Copulas for information retrieval

In many domains of information retrieval, system estimates of document relevance are based on multidimensional quality criteria that have to be accommodated in a unidimensional result ranking. Current solutions to this challenge are often inconsistent with the formal probabilistic framework in which constituent scores were estimated, or use sophisticated learning methods that make it difficult for humans to understand the origin of the final ranking. To address these issues, we introduce the use of copulas, a powerful statistical framework for modeling complex multi-dimensional dependencies, to information retrieval tasks. We provide a formal background to copulas and demonstrate their effectiveness on standard IR tasks such as combining multidimensional relevance estimates and fusion of results from multiple search engines. We introduce copula-based versions of standard relevance estimators and fusion methods and show that these lead to significant performance improvements on several tasks, as evaluated on large-scale standard corpora, compared to their non-copula counterparts. We also investigate criteria for understanding the likely effect of using copula models in a given retrieval scenario.

[1]  M. de Rijke,et al.  Formal models for expert finding in enterprise corpora , 2006, SIGIR.

[2]  Stephen E. Robertson,et al.  Modeling score distributions in information retrieval , 2011, Information Retrieval.

[3]  S. Rachev Handbook of heavy tailed distributions in finance , 2003 .

[4]  Elizabeth Foss,et al.  How children search the internet with keyword interfaces , 2009, IDC.

[5]  R. Manmatha,et al.  Modeling score distributions for combining the outputs of search engines , 2001, SIGIR '01.

[6]  Emiliano A. Valdez,et al.  Understanding Relationships Using Copulas , 1998 .

[7]  P. Friederichs,et al.  Multivariate non-normally distributed random variables in climate research - introduction to the copula approach , 2008 .

[8]  Edward A. Fox,et al.  Combination of Multiple Searches , 1993, TREC.

[9]  Craig MacDonald,et al.  Blog track research at TREC , 2010, SIGF.

[10]  Tie-Yan Liu,et al.  Learning to Rank for Information Retrieval , 2011 .

[11]  Djoerd Hiemstra,et al.  The Importance of Prior Probabilities for Entry Page Search , 2002, SIGIR '02.

[12]  Javed A. Aslam,et al.  Bayes optimal metasearch: a probabilistic model for combining the results of multiple retrieval systems (poster session) , 2000, SIGIR '00.

[13]  B. Renard,et al.  Use of a Gaussian copula for multivariate extreme value analysis: Some case studies in hydrology , 2007 .

[14]  W. Bruce Croft,et al.  Relevance Models in Information Retrieval , 2003 .

[15]  Xuanjing Huang,et al.  A unified relevance model for opinion retrieval , 2009, CIKM.

[16]  Javed A. Aslam,et al.  Condorcet fusion for improved retrieval , 2002, CIKM '02.

[17]  Pia Borlund,et al.  The concept of relevance in IR , 2003, J. Assoc. Inf. Sci. Technol..

[18]  Thorsten Schmidt,et al.  Coping with Copulas , 2006 .

[19]  Stephen E. Robertson,et al.  Okapi at TREC-3 , 1994, TREC.

[20]  Gabriella Pasi,et al.  Multidimensional Relevance: A New Aggregation Criterion , 2009, ECIR.

[21]  Stefano Mizzaro,et al.  Relevance: The Whole History , 1997, J. Am. Soc. Inf. Sci..

[22]  Stephen E. Robertson,et al.  GatfordCentre for Interactive Systems ResearchDepartment of Information , 1996 .

[23]  M. Sklar Fonctions de repartition a n dimensions et leurs marges , 1959 .

[24]  Shengli Wu,et al.  Data fusion with estimated weights , 2002, CIKM '02.

[25]  Michael B. Eisenberg,et al.  A re-examination of relevance: toward a dynamic, situational definition , 1990, Inf. Process. Manag..

[26]  Arjen P. de Vries,et al.  A combined topical/non-topical approach to identifying web sites for children , 2011, WSDM '11.

[27]  P. Embrechts,et al.  Chapter 8 – Modelling Dependence with Copulas and Applications to Risk Management , 2003 .

[28]  Stephen P. Harter,et al.  Psychological Relevance and Information Science , 1992, J. Am. Soc. Inf. Sci..

[29]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[30]  Ronan Cummins,et al.  Measuring the Ability of Score Distributions to Model Relevance , 2011, AIRS.

[31]  Gloria Bordogna,et al.  A model for a SOft Fusion of Information Accesses on the web , 2004, Fuzzy Sets Syst..

[32]  Mounia Lalmas,et al.  Combining Evidence for Relevance Criteria: A Framework and Experiments in Web Retrieval , 2007, ECIR.

[33]  Ryen W. White,et al.  Personalizing web search results by reading level , 2011, CIKM '11.

[34]  Stephen E. Robertson,et al.  Relevance weighting for query independent evidence , 2005, SIGIR '05.

[35]  J. Rank Copulas: From theory to application in Finance , 2006 .

[36]  Ludwig Bieberbach Schriften des mathematischen Instituts und des Instituts für angewandte Mathematik der Universität Berlin , 1938 .

[37]  D. Darling,et al.  A Test of Goodness of Fit , 1954 .

[38]  Tie-Yan Liu,et al.  Learning to rank for information retrieval , 2009, SIGIR.

[39]  Garrison W. Cottrell,et al.  Fusion Via a Linear Combination of Scores , 1999, Information Retrieval.

[40]  Evangelos Kanoulas,et al.  Score distribution models: assumptions, intuition, and robustness to score manipulation , 2010, SIGIR.

[41]  Olivier Scaillet,et al.  The estimation of copulas : theory and practice , 2007 .

[42]  Javed A. Aslam,et al.  Relevance score normalization for metasearch , 2001, CIKM '01.

[43]  Fabio Crestani,et al.  Score Transformation in Linear Combination for Multi-criteria Relevance Ranking , 2012, ECIR.

[44]  Justin Zobel,et al.  Filtered Document Retrieval with Frequency-Sorted Indexes , 1996, J. Am. Soc. Inf. Sci..

[45]  Gregory N. Hullender,et al.  Learning to rank using gradient descent , 2005, ICML.

[46]  W. Hoeffding Scale—Invariant Correlation Theory , 1994 .

[47]  Filip Radlinski,et al.  Query chains: learning to rank from implicit feedback , 2005, KDD '05.

[48]  Mor Naaman,et al.  Why we tag: motivations for annotation in mobile and online media , 2007, CHI.

[49]  J. Bouchaud,et al.  Theory of Financial Risk and Derivative Pricing: From Statistical Physics to Risk Management , 2011 .

[50]  Klaus Obermayer,et al.  Analyzing Short-Term Noise Dependencies of Spike-Counts in Macaque Prefrontal Cortex Using Copulas and the Flashlight Transformation , 2009, PLoS Comput. Biol..

[51]  Pablo Castells,et al.  Personalized diversification of search results , 2012, SIGIR '12.

[52]  Stephen E. Robertson,et al.  Simple BM25 extension to multiple weighted fields , 2004, CIKM '04.

[53]  Stephen E. Robertson,et al.  Field-Weighted XML Retrieval Based on BM25 , 2005, INEX.