A Review of Research in First-Stage Retrieval

In this paper, the first-stage retrieval technology is studied from four aspects: the development background, the frontier technology, the current challenges, and the future directions. Our contribution consists of two main parts. On the one hand, this paper reviewed some retrieval techniques proposed by researchers and drew targeted conclusions through comparative analysis. On the other hand, different research directions are discussed, and the impact of the combination of different techniques on first-stage retrieval is studied and compared. In this way, this survey provides a comprehensive overview of the field and will hopefully be used by researchers and practitioners in the first-stage retrieval domain, inspiring new ideas and further developments.

[1]  Hugo Proença,et al.  Information Retrieval: Recent Advances and Beyond , 2023, IEEE Access.

[2]  Jie Yu,et al.  Topic-Grained Text Representation-based Model for Document Retrieval , 2022, ICANN.

[3]  Huan-huan Zeng,et al.  Learning to rank method combining multi-head self-attention with conditional generative adversarial nets , 2022, Array.

[4]  Wayne Xin Zhao,et al.  RocketQAv2: A Joint Training Method for Dense Passage Retrieval and Passage Re-ranking , 2021, EMNLP.

[5]  Danqi Chen,et al.  Phrase Retrieval Learns Passage Retrieval, Too , 2021, EMNLP.

[6]  Chenyan Xiong,et al.  More Robust Dense Retrieval with Contrastive Dual Learning , 2021, ICTIR.

[7]  Junchao Chen,et al.  Construction of higher-order smooth positons and breather positons via Hirota’s bilinear method , 2021, Nonlinear Dynamics.

[8]  Dani Yogatama,et al.  End-to-End Training of Multi-Document Reader and Retriever for Open-Domain Question Answering , 2021, NeurIPS.

[9]  Fuzheng Zhang,et al.  ConSERT: A Contrastive Framework for Self-Supervised Sentence Representation Transfer , 2021, ACL.

[10]  Danqi Chen,et al.  SimCSE: Simple Contrastive Learning of Sentence Embeddings , 2021, EMNLP.

[11]  Jiarun Cao,et al.  Whitening Sentence Representations for Better Semantics and Faster Retrieval , 2021, ArXiv.

[12]  Wonjong Rhee,et al.  Improving Bi-encoder Document Ranking Models with Two Rankers and Multi-teacher Distillation , 2021, SIGIR.

[13]  Jiafeng Guo,et al.  Semantic Models for the First-Stage Retrieval: A Comprehensive Review , 2021, ACM Trans. Inf. Syst..

[14]  Danqi Chen,et al.  Learning Dense Representations of Phrases at Scale , 2020, ACL.

[15]  Xinlei Chen,et al.  Exploring Simple Siamese Representation Learning , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Jimmy J. Lin,et al.  Distilling Dense Representations for Ranking using Tightly-Coupled Teachers , 2020, ArXiv.

[17]  Hua Wu,et al.  RocketQA: An Optimized Training Approach to Dense Passage Retrieval for Open-Domain Question Answering , 2020, NAACL.

[18]  Allan Hanbury,et al.  Improving Efficient Neural Ranking Models with Cross-Architecture Knowledge Distillation , 2020, ArXiv.

[19]  Yinfei Yang,et al.  Neural Retrieval for Question Answering with Cross-Attention Supervised Data Augmentation , 2020, ACL.

[20]  Min Zhang,et al.  RepBERT: Contextualized Text Embeddings for First-Stage Retrieval , 2020, ArXiv.

[21]  Pierre H. Richemond,et al.  Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning , 2020, NeurIPS.

[22]  Fabio Petroni,et al.  Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks , 2020, NeurIPS.

[23]  Jacob Eisenstein,et al.  Sparse, Dense, and Attentional Representations for Text Retrieval , 2020, Transactions of the Association for Computational Linguistics.

[24]  Eugene Agichtein,et al.  RLIRank: Learning to Rank with Reinforcement Learning for Dynamic Search , 2020, WWW.

[25]  Danqi Chen,et al.  Dense Passage Retrieval for Open-Domain Question Answering , 2020, EMNLP.

[26]  Le Song,et al.  DC-BERT: Decoupling Question and Document for Efficient Contextual Encoding , 2020, SIGIR.

[27]  知秀 柴田 5分で分かる!? 有名論文ナナメ読み:Jacob Devlin et al. : BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding , 2020 .

[28]  Geoffrey E. Hinton,et al.  A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.

[29]  Kevin Duh,et al.  Explaining Sequence-Level Knowledge Distillation as Data-Augmentation for Neural Machine Translation , 2019, ArXiv.

[30]  Ross B. Girshick,et al.  Momentum Contrast for Unsupervised Visual Representation Learning , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Sung Ju Hwang,et al.  Rethinking Data Augmentation: Self-Supervision and Self-Distillation , 2019, ArXiv.

[32]  Jiashi Feng,et al.  Revisit Knowledge Distillation: a Teacher-free Framework , 2019, ArXiv.

[33]  Megha Nawhal,et al.  Lifelong GAN: Continual Learning for Conditional Image Generation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[34]  Ran El-Yaniv,et al.  Multi-Hop Paragraph Retrieval for Open-Domain Question Answering , 2019, ACL.

[35]  Nazli Goharian,et al.  CEDR: Contextualized Embeddings for Document Ranking , 2019, SIGIR.

[36]  W. Bruce Croft,et al.  A Deep Look into Neural Ranking Models for Information Retrieval , 2019, Inf. Process. Manag..

[37]  Seyed Iman Mirzadeh,et al.  Improved Knowledge Distillation via Teacher Assistant , 2019, AAAI.

[38]  Philip S. Yu,et al.  Private Model Compression via Knowledge Distillation , 2018, AAAI.

[39]  Yiqun Liu,et al.  Unbiased Learning to Rank: Theory and Practice , 2018, ICTIR.

[40]  Bo Li,et al.  Joint Learning from Labeled and Unlabeled Data for Information Retrieval , 2018, COLING.

[41]  Fernando Diaz,et al.  SIGIR 2018 Workshop on Learning from Limited or Noisy Data for Information Retrieval , 2018, SIGIR.

[42]  Md. Mustafizur Rahman,et al.  Neural information retrieval: at the end of the early years , 2017, Information Retrieval Journal.

[43]  Jin Young Choi,et al.  Knowledge Distillation with Adversarial Samples Supporting Decision Boundary , 2018, AAAI.

[44]  Bhaskar Mitra,et al.  Cross Domain Regularization for Neural Ranking Models using Adversarial Learning , 2018, SIGIR.

[45]  Ali Farhadi,et al.  Phrase-Indexed Question Answering: A New Challenge for Scalable Document Comprehension , 2018, EMNLP.

[46]  W. Bruce Croft,et al.  Learning a Deep Listwise Context Model for Ranking Refinement , 2018, SIGIR.

[47]  Jaap Kamps,et al.  Avoiding Your Teacher's Mistakes: Training Neural Networks with Controlled Weak Supervision , 2017, ArXiv.

[48]  Xueqi Cheng,et al.  DeepRank: A New Deep Architecture for Relevance Ranking in Information Retrieval , 2017, CIKM.

[49]  Miles Efron,et al.  Document Expansion Using External Collections , 2017, SIGIR.

[50]  Zhiyuan Liu,et al.  End-to-End Neural Ad-hoc Ranking with Kernel Pooling , 2017, SIGIR.

[51]  Huchuan Lu,et al.  Deep Mutual Learning , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[52]  Jianfeng Gao,et al.  MS MARCO: A Human Generated MAchine Reading COmprehension Dataset , 2016, CoCo@NIPS.

[53]  Nick Craswell,et al.  Learning to Match using Local and Distributed Representations of Text for Web Search , 2016, WWW.

[54]  W. Bruce Croft,et al.  A Deep Relevance Matching Model for Ad-hoc Retrieval , 2016, CIKM.

[55]  Azadeh Shakery,et al.  Pseudo-Relevance Feedback Based on Matrix Factorization , 2016, CIKM.

[56]  Ben He,et al.  Training query filtering for semi-supervised learning to rank with pseudo labels , 2016, World Wide Web.

[57]  Xueqi Cheng,et al.  Match-SRNN: Modeling the Recursive Matching Structure with Spatial RNN , 2016, IJCAI.

[58]  Bowen Zhou,et al.  Abstractive Text Summarization using Sequence-to-sequence RNNs and Beyond , 2016, CoNLL.

[59]  Xueqi Cheng,et al.  A Deep Architecture for Semantic Matching with Multiple Positional Sentence Representations , 2015, AAAI.

[60]  Ananthram Swami,et al.  Distillation as a Defense to Adversarial Perturbations Against Deep Neural Networks , 2015, 2016 IEEE Symposium on Security and Privacy (SP).

[61]  Alessandro Moschitti,et al.  Learning to Rank Short Text Pairs with Convolutional Deep Neural Networks , 2015, SIGIR.

[62]  Xuanjing Huang,et al.  Convolutional Neural Tensor Network Architecture for Community-Based Question Answering , 2015, IJCAI.

[63]  Zhuo Wang,et al.  Optimization and analysis of large scale data sorting algorithm based on Hadoop , 2015, ArXiv.

[64]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[65]  Rabab Kreidieh Ward,et al.  Deep Sentence Embedding Using Long Short-Term Memory Networks: Analysis and Application to Information Retrieval , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[66]  Hang Li,et al.  Convolutional Neural Network Architectures for Matching Natural Language Sentences , 2014, NIPS.

[67]  Yelong Shen,et al.  A Latent Semantic Model with Convolutional-Pooling Structure for Information Retrieval , 2014, CIKM.

[68]  Larry P. Heck,et al.  Learning deep structured semantic models for web search using clickthrough data , 2013, CIKM.

[69]  Zhendong Niu,et al.  Concept Based Query Expansion , 2013, 2013 Ninth International Conference on Semantics, Knowledge and Grids.

[70]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[71]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition , 2012 .

[72]  Jianfeng Gao,et al.  Towards Concept-Based Translation Models Using Search Logs for Query Expansion , 2012, Proceedings of the 21st ACM international conference on Information and knowledge management.

[73]  Katrina Fenlon,et al.  Improving retrieval of short texts through document expansion , 2012, SIGIR '12.

[74]  ChengXiang Zhai,et al.  Axiomatic Analysis of Translation Language Model for Information Retrieval , 2012, ECIR.

[75]  Jimmy J. Lin,et al.  Pseudo test collections for learning web search ranking functions , 2011, SIGIR.

[76]  Hang Li,et al.  Book Reviews: Semantic Similarity from Natural Language and Ontology Analysis by Sébastien Harispe, Sylvie Ranwez, Stefan Janaqi, and Jacky Montmain , 2015, CL.

[77]  Yue Lu,et al.  Investigating task performance of probabilistic topic models: an empirical study of PLSA and LDA , 2011, Information Retrieval.

[78]  Charles Elkan,et al.  Latent semantic indexing (LSI) fails for TREC collections , 2011, SKDD.

[79]  Hang Li,et al.  Relevance Ranking Using Kernels , 2010, AIRS.

[80]  Jianfeng Gao,et al.  Clickthrough-based translation models for web search: from word models to phrase models , 2010, CIKM.

[81]  Yi Liu,et al.  Query Rewriting Using Monolingual Statistical Machine Translation , 2010, CL.

[82]  Arantxa Otegi,et al.  Document Expansion Based on WordNet for Robust IR , 2010, COLING.

[83]  ChengXiang Zhai,et al.  Estimation of statistical translation models based on mutual information for ad hoc information retrieval , 2010, SIGIR.

[84]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[85]  ChengXiang Zhai,et al.  A comparative study of methods for estimating query language models with pseudo feedback , 2009, CIKM.

[86]  Tie-Yan Liu,et al.  Ranking Measures and Loss Functions in Learning to Rank , 2009, NIPS.

[87]  James Allan,et al.  A Comparative Study of Utilizing Topic Models for Information Retrieval , 2009, ECIR.

[88]  Tie-Yan Liu,et al.  Learning to rank for information retrieval , 2009, SIGIR.

[89]  W. Bruce Croft,et al.  Search Engines - Information Retrieval in Practice , 2009 .

[90]  Stephen E. Robertson,et al.  Selecting good expansion terms for pseudo-relevance feedback , 2008, SIGIR '08.

[91]  Quoc V. Le,et al.  Learning to Rank with Nonsmooth Cost Functions , 2006, NIPS.

[92]  W. Bruce Croft,et al.  LDA-based document models for ad-hoc retrieval , 2006, SIGIR.

[93]  Tao Tao,et al.  Language Model Information Retrieval with Document Expansion , 2006, NAACL.

[94]  Fernando Diaz,et al.  Regularizing ad hoc retrieval scores , 2005, CIKM '05.

[95]  Gregory N. Hullender,et al.  Learning to rank using gradient descent , 2005, ICML.

[96]  W. Bruce Croft,et al.  Cluster-based retrieval using language models , 2004, SIGIR '04.

[97]  Jianfeng Gao,et al.  Dependence language model for information retrieval , 2004, SIGIR '04.

[98]  Oren Kurland,et al.  Corpus structure, language models, and ad hoc information retrieval , 2004, SIGIR '04.

[99]  James Allan,et al.  Capturing term dependencies using a language model based on sentence trees , 2002, CIKM '02.

[100]  C. J. van Rijsbergen,et al.  Probabilistic models of information retrieval based on measuring the divergence from randomness , 2002, TOIS.

[101]  John D. Lafferty,et al.  Model-based feedback in the language modeling approach to information retrieval , 2001, CIKM '01.

[102]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[103]  Stephen E. Robertson,et al.  A probabilistic model of information retrieval: development and comparative experiments - Part 2 , 2000, Inf. Process. Manag..

[104]  W. Bruce Croft,et al.  A general language model for information retrieval , 1999, CIKM '99.

[105]  John D. Lafferty,et al.  Information retrieval as statistical translation , 1999, SIGIR '99.

[106]  W. Bruce Croft,et al.  A Language Modeling Approach to Information Retrieval , 1998, SIGIR Forum.

[107]  Claire Cardie,et al.  An Analysis of Statistical and Syntactic Phrases , 1997, RIAO.

[108]  Ellen M. Voorhees,et al.  Query expansion using lexical-semantic relations , 1994, SIGIR '94.

[109]  G Salton,et al.  Developments in Automatic Text Retrieval , 1991, Science.

[110]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[111]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[112]  Joel L Fagan,et al.  Experiments in Automatic Phrase Indexing For Document Retrieval: A Comparison of Syntactic and Non-Syntactic Methods , 1987 .

[113]  P. C. Wong,et al.  Generalized vector spaces model in information retrieval , 1985, SIGIR '85.

[114]  Van Rijsbergen,et al.  A theoretical basis for the use of co-occurence data in information retrieval , 1977 .

[115]  Stephen E. Robertson,et al.  Relevance weighting of search terms , 1976, J. Am. Soc. Inf. Sci..

[116]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[117]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[118]  Yoav Goldberg Neural Network Methods for Natural Language Processing , 2017 .

[119]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[120]  Justin Zobel,et al.  Document expansion versus query expansion for ad-hoc retrieval , 2005 .

[121]  Yuet Meng. Lee,et al.  Query expansion using lexical-semantic relations , 1999 .

[122]  J. J. Rocchio,et al.  Relevance feedback in information retrieval , 1971 .

[123]  Gerard Salton,et al.  The SMART Retrieval System—Experiments in Automatic Document Processing , 1971 .

[124]  Michael Lesk,et al.  Word-word associations in document retrieval systems , 1969 .