Efficient Extended Boolean Retrieval

Extended Boolean retrieval (EBR) models were proposed nearly three decades ago, but have had little practical impact, despite their significant advantages compared to either ranked keyword or pure Boolean retrieval. In particular, EBR models produce meaningful rankings; their query model allows the representation of complex concepts in an and-or format; and they are scrutable, in that the score assigned to a document depends solely on the content of that document, unaffected by any collection statistics or other external factors. These characteristics make EBR models attractive in domains typified by medical and legal searching, where the emphasis is on iterative development of reproducible complex queries of dozens or even hundreds of terms. However, EBR is much more computationally expensive than the alternatives. We consider the implementation of the p-norm approach to EBR, and demonstrate that ideas used in the max-score and wand exact optimization techniques for ranked keyword retrieval can be adapted to allow selective bypass of documents via a low-cost screening process for this and similar retrieval models. We also propose term-independent bounds that are able to further reduce the number of score calculations for short, simple queries under the extended Boolean retrieval model. Together, these methods yield an overall saving from 50 to 80 percent of the evaluation cost on test queries drawn from biomedical search.

[1]  Gerhard Weikum,et al.  Top-k Query Evaluation with Probabilistic Guarantees , 2004, VLDB.

[2]  Howard R. Turtle,et al.  Query Evaluation: Strategies and Optimizations , 1995, Inf. Process. Manag..

[3]  Alistair Moffat,et al.  Self-indexing inverted files for fast text retrieval , 1996, TOIS.

[4]  Faith McLellan 1966 and all that—when is a literature search done? , 2001, The Lancet.

[5]  Alistair Moffat,et al.  Pruned query evaluation using pre-computed impacts , 2006, SIGIR.

[6]  Falk Scholer,et al.  The challenge of high recall in biomedical systematic search , 2009, DTMBIO.

[7]  Andrei Z. Broder,et al.  Efficient query evaluation using a two-level retrieval process , 2003, CIKM '03.

[8]  Edward A. Fox,et al.  Research Contributions , 2014 .

[9]  David Moher,et al.  An evidence-based practice guideline for the peer review of electronic search strategies. , 2009, Journal of clinical epidemiology.

[10]  W. Bruce Croft,et al.  Inference networks for document retrieval , 1989, SIGIR '90.

[11]  William R. Hersh,et al.  Reducing workload in systematic review preparation using automated citation classification. , 2006, Journal of the American Medical Informatics Association : JAMIA.

[12]  Maria Elena Smith,et al.  Aspects of the P-Norm Model of Information Retrieval: Syntactic Query Generation, Efficiency, And Theoretical , 1990 .

[13]  Gerard Salton,et al.  Research and Development in Information Retrieval , 1982, Lecture Notes in Computer Science.

[14]  Myoung-Ho Kim,et al.  On the evaluation of Boolean operators in the extended Boolean retrieval framework , 1993, SIGIR.

[15]  Donald H. Kraft,et al.  A mathematical model of a weighted boolean retrieval system , 1979, Inf. Process. Manag..

[16]  Alistair Moffat,et al.  Has adhoc retrieval improved since 1994? , 2009, SIGIR.

[17]  C. Paice Soft evaluation of Boolean search queries in information retrieval systems , 1984 .

[18]  J. Higgins,et al.  Cochrane Handbook for Systematic Reviews of Interventions, Version 5.1.0. The Cochrane Collaboration , 2013 .

[19]  Alistair Moffat,et al.  Extended Boolean retrieval for systematic biomedical reviews , 2010, ACSC.

[20]  J. Lee Analyzing the Effectiveness of Extended Boolean Models in Information Retrieval , 1995 .

[21]  Tadeusz Radecki,et al.  Fuzzy set theoretical approach to document retrieval , 1979, Inf. Process. Manag..

[22]  W. Bruce Croft,et al.  Combining the language model and inference network approaches to retrieval , 2004, Inf. Process. Manag..

[23]  Ingmar Weber,et al.  Type less, find more: fast autocompletion search with a succinct index , 2006, SIGIR.

[24]  Ellen M. Voorhees,et al.  Automatic assignment of soft Boolean operators , 1985, SIGIR '85.

[25]  W. Bruce Croft,et al.  Optimization strategies for complex queries , 2005, SIGIR '05.

[26]  Elmer V. Bernstam,et al.  A day in the life of PubMed: analysis of a typical day's query log. , 2007, Journal of the American Medical Informatics Association : JAMIA.

[27]  JUSTIN ZOBEL,et al.  Inverted files for text search engines , 2006, CSUR.

[28]  Douglas W. Oard,et al.  Overview of the TREC 2008 Legal Track , 2008, TREC.

[29]  Li Zhang,et al.  Optimizing search strategies to identify randomized controlled trials in MEDLINE , 2006, BMC medical research methodology.