Learning Passage Impacts for Inverted Indexes

Neural information retrieval systems typically use a cascading pipeline, in which a first-stage model retrieves a candidate set of documents and one or more subsequent stages re-rank this set using contextualized language models such as BERT. In this paper, we propose DeepImpact, a new document term-weighting scheme suitable for efficient retrieval using a standard inverted index. Compared to existing methods, DeepImpact improves impact-score modeling and tackles the vocabulary-mismatch problem. In particular, DeepImpact leverages DocT5Query to enrich the document collection and, using a contextualized language model, directly estimates the semantic importance of tokens in a document, producing a single-value representation for each token in each document. Our experiments show that DeepImpact significantly outperforms prior first-stage retrieval approaches by up to 17% on effectiveness metrics w.r.t. DocT5Query, and, when deployed in a re-ranking scenario, can reach the same effectiveness of state-of-the-art approaches with up to 5.1x speedup in efficiency.

[1]  Torsten Suel,et al.  PISA: Performant Indexes and Search for Academia , 2019, OSIRRC@SIGIR.

[2]  Ye Li,et al.  Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval , 2020, ArXiv.

[3]  Jimmy J. Lin,et al.  Anserini: Enabling the Use of Lucene for Information Retrieval Research , 2017, SIGIR.

[4]  Howard R. Turtle,et al.  Query Evaluation: Strategies and Optimizations , 1995, Inf. Process. Manag..

[5]  Bhaskar Mitra,et al.  Overview of the TREC 2019 deep learning track , 2020, ArXiv.

[6]  Kyunghyun Cho,et al.  Passage Re-ranking with BERT , 2019, ArXiv.

[7]  James P. Callan,et al.  Context-Aware Document Term Weighting for Ad-Hoc Search , 2020, WWW.

[8]  Nazli Goharian,et al.  CEDR: Contextualized Embeddings for Document Ranking , 2019, SIGIR.

[9]  Jun Xu,et al.  SparTerm: Learning Term-based Sparse Representation for Fast Text Retrieval , 2020, ArXiv.

[10]  Sean MacAvaney,et al.  OpenNIR: A Complete Neural Ad-Hoc Ranking Pipeline , 2020, WSDM.

[11]  Jimmy J. Lin,et al.  Document Expansion by Query Prediction , 2019, ArXiv.

[12]  Andrew Trotman,et al.  Supporting Interoperability Between Open-Source Search Engines with the Common Index File Format , 2020, SIGIR.

[13]  Le Zhao,et al.  Modeling and solving term mismatch for full-text retrieval , 2012, SIGF.

[14]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[15]  Joel Mackenzie,et al.  Efficiency Implications of Term Weighting for Passage Retrieval , 2020, SIGIR.

[16]  Jimmy J. Lin,et al.  Multi-Stage Document Ranking with BERT , 2019, ArXiv.

[17]  Jianfeng Gao,et al.  A Human Generated MAchine Reading COmprehension Dataset , 2018 .

[18]  Raffaele Perego,et al.  Expansion via Prediction of Importance with Contextualization , 2020, SIGIR.

[19]  Allan Hanbury,et al.  Local Self-Attention over Long Text for Efficient Document Retrieval , 2020, SIGIR.

[20]  Allan Hanbury,et al.  Let's measure run time! Extending the IR replicability infrastructure to include performance aspects , 2019, OSIRRC@SIGIR.

[21]  D. Cheriton From doc2query to docTTTTTquery , 2019 .

[22]  M. Zaharia,et al.  ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT , 2020, SIGIR.

[23]  Raffaele Perego,et al.  Efficient Document Re-Ranking for Transformers by Precomputing Term Representations , 2020, SIGIR.

[24]  Zhuyun Dai,et al.  Context-Aware Sentence/Passage Term Importance Estimation For First Stage Retrieval , 2019, ArXiv.