Doc2Query-: When Less is More

Doc2Query -- the process of expanding the content of a document before indexing using a sequence-to-sequence model -- has emerged as a prominent technique for improving the first-stage retrieval effectiveness of search engines. However, sequence-to-sequence models are known to be prone to"hallucinating"content that is not present in the source text. We argue that Doc2Query is indeed prone to hallucination, which ultimately harms retrieval effectiveness and inflates the index size. In this work, we explore techniques for filtering out these harmful queries prior to indexing. We find that using a relevance model to remove poor-quality queries can improve the retrieval effectiveness of Doc2Query by up to 16%, while simultaneously reducing mean query execution time by 23% and cutting the index size by 33%. We release the code, data, and a live demonstration to facilitate reproduction and further exploration at https://github.com/terrierteam/pyterrier_doc2query.

[1]  Sean MacAvaney,et al.  Adaptive Re-Ranking with a Corpus Graph , 2022, CIKM.

[2]  C. MacDonald,et al.  An Inspection of the Reproducibility and Replicability of TCT-ColBERT , 2022, SIGIR.

[3]  Rodrigo Nogueira,et al.  InPars: Unsupervised Dataset Generation for Information Retrieval , 2022, SIGIR.

[4]  G. Zuccon,et al.  Reduce, Reuse, Recycle: Green Information Retrieval Research , 2022, SIGIR.

[5]  Sean MacAvaney,et al.  A Python Interface to PISA! , 2022, SIGIR.

[6]  Iadh Ounis,et al.  Streamlining Evaluation with ir-measures , 2021, ECIR.

[7]  Andrew Yates,et al.  Squeezing Water from a Stone: A Bag of Tricks for Further Improving Cross-Encoder Effectiveness for Reranking , 2022, ECIR.

[8]  Benjamin Piwowarski,et al.  SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking , 2021, SIGIR.

[9]  Guido Zuccon,et al.  TILDE: Term Independent Likelihood moDEl for Passage Re-ranking , 2021, SIGIR.

[10]  Torsten Suel,et al.  Learning Passage Impacts for Inverted Indexes , 2021, SIGIR.

[11]  Doug Downey,et al.  Simplified Data Wrangling with ir_datasets , 2021, SIGIR.

[12]  Jimmy J. Lin,et al.  The Expando-Mono-Duo Design Pattern for Text Ranking with Pretrained Sequence-to-Sequence Models , 2021, ArXiv.

[13]  Tiancheng Zhao,et al.  SPARTA: Efficient Open-Domain Question Answering via Sparse Transformer Matching Retrieval , 2020, NAACL.

[14]  Paul N. Bennett,et al.  Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval , 2020, ICLR.

[15]  Jeff Johnson,et al.  Billion-Scale Similarity Search with GPUs , 2017, IEEE Transactions on Big Data.

[16]  Jimmy J. Lin,et al.  On the Separation of Logical and Physical Ranking Models for Text Retrieval Applications , 2021, DESIRES.

[17]  Jimmy J. Lin,et al.  In-Batch Negatives for Knowledge Distillation with Tightly-Coupled Teachers for Dense Retrieval , 2021, REPL4NLP.

[18]  Craig Macdonald,et al.  Declarative Experimentation in Information Retrieval using PyTerrier , 2020, ICTIR.

[19]  Paul N. Bennett,et al.  Few-Shot Generative Conversational Query Rewriting , 2020, SIGIR.

[20]  Ryan McDonald,et al.  On Faithfulness and Factuality in Abstractive Summarization , 2020, ACL.

[21]  Raffaele Perego,et al.  Expansion via Prediction of Importance with Contextualization , 2020, SIGIR.

[22]  M. Zaharia,et al.  ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT , 2020, SIGIR.

[23]  James P. Callan,et al.  Context-Aware Document Term Weighting for Ad-Hoc Search , 2020, WWW.

[24]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[25]  Jamie Callan,et al.  Deeper Text Understanding for IR with Contextual Neural Language Modeling , 2019, SIGIR.

[26]  Rajarshi Das,et al.  Multi-step Retriever-Reader Interaction for Scalable Open-domain Question Answering , 2019, ICLR.

[27]  Jimmy J. Lin,et al.  Document Expansion by Query Prediction , 2019, ArXiv.

[28]  Kyunghyun Cho,et al.  Passage Re-ranking with BERT , 2019, ArXiv.

[29]  D. Cheriton From doc2query to docTTTTTquery , 2019 .

[30]  Torsten Suel,et al.  PISA: Performant Indexes and Search for Academia , 2019, OSIRRC@SIGIR.

[31]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[32]  Jianfeng Gao,et al.  A Human Generated MAchine Reading COmprehension Dataset , 2018 .

[33]  Katrina Fenlon,et al.  Improving retrieval of short texts through document expansion , 2012, SIGIR '12.

[34]  Torsten Suel,et al.  Faster top-k document retrieval using block-max indexes , 2011, SIGIR.

[35]  Matthew Cooper,et al.  Reverted indexing for feedback and expansion , 2010, CIKM.

[36]  Iadh Ounis,et al.  Studying Query Expansion Effectiveness , 2009, ECIR.

[37]  Tao Tao,et al.  Language Model Information Retrieval with Document Expansion , 2006, NAACL.

[38]  Fernando Diaz,et al.  UMass at TREC 2004: Novelty and HARD , 2004, TREC.

[39]  C. J. van Rijsbergen,et al.  Probabilistic models of information retrieval based on measuring the divergence from randomness , 2002, TOIS.

[40]  S. T. Dumais,et al.  Using latent semantic analysis to improve access to textual information , 1988, CHI '88.