Improving Retrievability and Recall by Automatic Corpus Partitioning

With increasing volumes of data, much effort has been devoted to finding the most suitable answer to an information need. However, in many domains, the question whether any specific information item can be found at all via a reasonable set of queries is essential. This concept of Retrievability of information has evolved into an important evaluation measure of IR systems in recall-oriented application domains. While several studies evaluated retrieval bias in systems, solid validation of the impact of retrieval bias and the development of methods to counter low retrievability of certain document types would be desirable. This paper provides an in-depth study of retrievability characteristics over queries of different length in a large benchmark corpus, validating previous studies. It analyzes the possibility of automatically categorizing documents into low and high retrievable documents based on document properties rather than complex retrievability analysis. We furthermore show, that this classification can be used to improve overall retrievability of documents by treating these classes as separate document corpora, combining individual retrieval results. Experiments are validated on 1.2 million patents of the TREC Chemical Retrieval Track.

[1]  Leif Azzopardi,et al.  A Methodology for Building a Patent Test Collection for Prior Art Search , 2008, EVIA@NTCIR.

[2]  ChengXiang Zhai,et al.  Risk minimization and language modeling in text retrieval dissertation abstract , 2002, SIGF.

[3]  Stephen E. Robertson,et al.  Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval , 1994, SIGIR '94.

[4]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[5]  Andreas Rauber,et al.  Identification of low/high retrievable patents using content-based features , 2009, PaIR@CIKM.

[6]  Peter Ingwersen,et al.  Developing a Test Collection for the Evaluation of Integrated Search , 2010, ECIR.

[7]  Andreas Rauber,et al.  Analyzing Document Retrievability in Patent Retrieval Settings , 2009, DEXA.

[8]  Khalid Al-Kofahi,et al.  A new approach for evaluating query expansion: query-document term mismatch , 2007, SIGIR.

[9]  Tetsuya Sakai Comparing metrics across TREC and NTCIR: the robustness to system bias , 2008, CIKM '08.

[10]  Man Lung Yiu,et al.  Group-by skyline query processing in relational engines , 2009, CIKM.

[11]  W. Bruce Croft,et al.  Transforming patents into prior-art queries , 2009, SIGIR.

[12]  Mike Thelwall,et al.  Search engine coverage bias: evidence and possible causes , 2004, Inf. Process. Manag..

[13]  Ricardo A. Baeza-Yates,et al.  Applications of Web Query Mining , 2005, ECIR.

[14]  Hideo Itoh,et al.  Term Distillation in Patent Retrieval , 2003, ACL 2003.

[15]  Qigang Gao,et al.  Using controlled query generation to evaluate blind relevance feedback algorithms , 2006, Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '06).

[16]  Atsushi Fujii Enhancing patent retrieval by citation analysis , 2007, SIGIR.

[17]  Leif Azzopardi,et al.  Retrievability: an evaluation measure for higher order information access tasks , 2008, CIKM '08.

[18]  Stephen E. Robertson,et al.  Simple BM25 extension to multiple weighted fields , 2004, CIKM '04.

[19]  Masaki Aono,et al.  A Patent Retrieval Method Using a Hierarchy of Clusters at TUT , 2005, NTCIR.

[20]  Xiangji Huang,et al.  TREC-CHEM: large scale chemical information retrieval evaluation at TREC , 2009, SIGF.

[21]  Ian Witten,et al.  Data Mining , 2000 .