Typos-aware Bottlenecked Pre-Training for Robust Dense Retrieval

Current dense retrievers (DRs) are limited in their ability to effectively process misspelled queries, which constitute a significant portion of query traffic in commercial search engines. The main issue is that the pre-trained language model-based encoders used by DRs are typically trained and fine-tuned using clean, well-curated text data. Misspelled queries are typically not found in the data used for training these models, and thus misspelled queries observed at inference time are out-of-distribution compared to the data used for training and fine-tuning. Previous efforts to address this issue have focused on \textit{fine-tuning} strategies, but their effectiveness on misspelled queries remains lower than that of pipelines that employ separate state-of-the-art spell-checking components. To address this challenge, we propose ToRoDer (TypOs-aware bottlenecked pre-training for RObust DEnse Retrieval), a novel \textit{pre-training} strategy for DRs that increases their robustness to misspelled queries while preserving their effectiveness in downstream retrieval tasks. ToRoDer utilizes an encoder-decoder architecture where the encoder takes misspelled text with masked tokens as input and outputs bottlenecked information to the decoder. The decoder then takes as input the bottlenecked embeddings, along with token embeddings of the original text with the misspelled tokens masked out. The pre-training task is to recover the masked tokens for both the encoder and decoder. Our extensive experimental results and detailed ablation studies show that DRs pre-trained with ToRoDer exhibit significantly higher effectiveness on misspelled queries, sensibly closing the gap with pipelines that use a separate, complex spell-checker component, while retaining their effectiveness on correctly spelled queries.

[1]  Xueqi Cheng,et al.  Are Neural Ranking Models Robust? , 2021, ACM Trans. Inf. Syst..

[2]  G. Zuccon,et al.  Robustness of Neural Rankers to Typos: A Comparative Study , 2022, ADCS.

[3]  Wayne Xin Zhao,et al.  Dense Text Retrieval Based on Pretrained Language Models: A Survey , 2022, ACM Trans. Inf. Syst..

[4]  Tao Shen,et al.  LexMAE: Lexicon-Bottlenecked Pretraining for Large-Scale Retrieval , 2022, ArXiv.

[5]  Zhongyuan Wang,et al.  ConTextual Masked Auto-Encoder for Dense Passage Retrieval , 2022, AAAI.

[6]  Furu Wei,et al.  SimLM: Pre-training with Representation Bottleneck for Dense Passage Retrieval , 2022, ACL.

[7]  Yubin Kim Applications and Future of Dense Retrieval in Industry , 2022, SIGIR.

[8]  Le Sun,et al.  Towards Robust Dense Retrieval via Local Ranking Alignment , 2022, IJCAI.

[9]  Wayne Xin Zhao,et al.  A Thorough Examination on Zero-shot Dense Retrieval , 2022, EMNLP.

[10]  James R. Glass,et al.  DiffCSE: Difference-based Contrastive Learning for Sentence Embeddings , 2022, NAACL.

[11]  G. Zuccon,et al.  CharacterBERT and Self-Teaching for Improving the Robustness of Dense Retrievers on Queries with Typos , 2022, SIGIR.

[12]  Jimmy J. Lin,et al.  Tevatron: An Efficient and Flexible Toolkit for Dense Retrieval , 2022, ArXiv.

[13]  G. Zuccon,et al.  Asyncval: A Toolkit for Asynchronously Validating Dense Retriever Checkpoints During Training , 2022, SIGIR.

[14]  Michael Bendersky,et al.  Out-of-Domain Semantics to the Rescue! Zero-Shot Hybrid Retrieval Models , 2022, ECIR.

[15]  C. Hauff,et al.  Evaluating the Robustness of Retrieval Pipelines with Query Variation Generators , 2021, ECIR.

[16]  Luyu Gao,et al.  Unsupervised Corpus Aware Language Model Pre-training for Dense Passage Retrieval , 2021, ACL.

[17]  Yingxia Shao,et al.  RetroMAE: Pre-training Retrieval-oriented Transformers via Masked Auto-Encoder , 2022, ArXiv.

[18]  Elias Bassani ranx: A Blazing-Fast Python Library for Ranking Evaluation and Comparison , 2022, ECIR.

[19]  Wayne Xin Zhao,et al.  RocketQAv2: A Joint Training Method for Dense Passage Retrieval and Passage Re-ranking , 2021, EMNLP.

[20]  Danqi Chen,et al.  Simple Entity-Centric Questions Challenge Dense Retrievers , 2021, EMNLP.

[21]  Guido Zuccon,et al.  Dealing with Typos for BERT-based Passage Retrieval and Ranking , 2021, EMNLP.

[22]  Hua Wu,et al.  PAIR: Leveraging Passage-Centric Similarity Relation for Improving Dense Passage Retrieval , 2021, FINDINGS.

[23]  Andrew Yates,et al.  How Deep is your Learning: the DL-HARD Annotated Deep Learning Dataset , 2021, SIGIR.

[24]  Iryna Gurevych,et al.  BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models , 2021, NeurIPS Datasets and Benchmarks.

[25]  Jamie Callan,et al.  Condenser: a Pre-training Architecture for Dense Retrieval , 2021, EMNLP.

[26]  Jiafeng Guo,et al.  Optimizing Dense Retrieval Model Training with Hard Negatives , 2021, SIGIR.

[27]  Jimmy J. Lin,et al.  Efficiently Teaching an Effective Dense Retriever with Balanced Topic Aware Sampling , 2021, SIGIR.

[28]  Luyu Gao,et al.  Rethink Training of BERT Rerankers in Multi-Stage Retrieval Pipeline , 2021, ECIR.

[29]  Hua Wu,et al.  RocketQA: An Optimized Training Approach to Dense Passage Retrieval for Open-Domain Question Answering , 2020, NAACL.

[30]  Paul N. Bennett,et al.  Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval , 2020, ICLR.

[31]  Paul N. Bennett,et al.  Less is More: Pretrain a Strong Siamese Encoder for Dense Text Retrieval Using a Weak Decoder , 2021, EMNLP.

[32]  Jimmy J. Lin,et al.  In-Batch Negatives for Knowledge Distillation with Tightly-Coupled Teachers for Dense Retrieval , 2021, REPL4NLP.

[33]  Jimmy J. Lin,et al.  Distilling Dense Representations for Ranking using Tightly-Coupled Teachers , 2020, ArXiv.

[34]  Pierre Zweigenbaum,et al.  CharacterBERT: Reconciling ELMo and BERT for Word-Level Open-Vocabulary Representations From Characters , 2020, COLING.

[35]  Allan Hanbury,et al.  Improving Efficient Neural Ranking Models with Cross-Architecture Knowledge Distillation , 2020, ArXiv.

[36]  Min Zhang,et al.  RepBERT: Contextualized Text Embeddings for First-Stage Retrieval , 2020, ArXiv.

[37]  Linjun Yang,et al.  Embedding-based Retrieval in Facebook Search , 2020, KDD.

[38]  Danqi Chen,et al.  Dense Passage Retrieval for Open-Domain Question Answering , 2020, EMNLP.

[39]  Quoc V. Le,et al.  ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators , 2020, ICLR.

[40]  Lysandre Debut,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[41]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[42]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[43]  Matthias Hagen,et al.  A Large-Scale Query Spelling Correction Corpus , 2017, SIGIR.

[44]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[45]  Jianfeng Gao,et al.  A Human Generated MAchine Reading COmprehension Dataset , 2018 .

[46]  Alistair Moffat,et al.  Rank-biased precision for measurement of retrieval effectiveness , 2008, TOIS.

[47]  W. John Wilbur,et al.  Spelling correction in the PubMed search engine , 2006, Information Retrieval.

[48]  Peiling Wang,et al.  Mining longitudinal web queries: Trends and patterns , 2003, J. Assoc. Inf. Sci. Technol..

[49]  Amanda Spink,et al.  Searching the Web: the public and their queries , 2001 .

[50]  Ragnar Nordlie,et al.  “User revealment”—a comparison of initial queries and ensuing question development in online searching and in human reference interactions , 1999, SIGIR '99.

[51]  Susan T. Dumais,et al.  The vocabulary problem in human-system communication , 1987, CACM.