Learning Discrete Representations via Constrained Clustering for Effective and Efficient Dense Retrieval

Dense Retrieval (DR) has achieved state-of-the-art first-stage ranking effectiveness. However, the efficiency of most existing DR models is limited by the large memory cost of storing dense vectors and the time-consuming nearest neighbor search (NNS) in vector space. Therefore, we present RepCONC, a novel retrieval model that learns discrete Representations via CONstrained Clustering. RepCONC jointly trains dual-encoders and the Product Quantization (PQ) method to learn discrete document representations and enables fast approximate NNS with compact indexes. It models quantization as a constrained clustering process, which requires the document embeddings to be uniformly clustered around the quantization centroids and supports end-to-end optimization of the quantization method and dual-encoders. We theoretically demonstrate the importance of the uniform clustering constraint in RepCONC and derive an efficient approximate solution for constrained clustering by reducing it to an instance of the optimal transport problem. Besides constrained clustering, RepCONC further adopts a vectorbased inverted file system (IVF) to support highly efficient vector search on CPUs. Extensive experiments on two popular ad-hoc retrieval benchmarks show that RepCONC achieves better ranking effectiveness than competitive vector quantization baselines under different compression ratio settings. It also substantially outperforms a wide range of existing retrieval models in terms of retrieval effectiveness, memory efficiency, and time efficiency.

[1]  James P. Callan,et al.  Context-Aware Document Term Weighting for Ad-Hoc Search , 2020, WWW.

[2]  Bhaskar Mitra,et al.  Overview of the TREC 2019 deep learning track , 2020, ArXiv.

[3]  Christopher J. C. Burges,et al.  From RankNet to LambdaRank to LambdaMART: An Overview , 2010 .

[4]  Marco Cuturi,et al.  Sinkhorn Distances: Lightspeed Computation of Optimal Transport , 2013, NIPS.

[5]  Yoshua Bengio,et al.  Convergence Properties of the K-Means Algorithms , 1994, NIPS.

[6]  Jimmy J. Lin,et al.  Document Expansion by Query Prediction , 2019, ArXiv.

[7]  Danqi Chen,et al.  Dense Passage Retrieval for Open-Domain Question Answering , 2020, EMNLP.

[8]  M. Zaharia,et al.  ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT , 2020, SIGIR.

[9]  Stephen E. Robertson,et al.  Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval , 1994, SIGIR '94.

[10]  Songlin Wang,et al.  Joint Learning of Deep Retrieval Model and Product Quantization based Embedding Index , 2021, SIGIR.

[11]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[12]  Jeff Johnson,et al.  Billion-Scale Similarity Search with GPUs , 2017, IEEE Transactions on Big Data.

[13]  Zhuyun Dai,et al.  Context-Aware Sentence/Passage Term Importance Estimation For First Stage Retrieval , 2019, ArXiv.

[14]  Allan Hanbury,et al.  Efficiently Teaching an Effective Dense Retriever with Balanced Topic Aware Sampling , 2021, SIGIR.

[15]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[16]  Linjun Yang,et al.  Embedding-based Retrieval in Facebook Search , 2020, KDD.

[17]  D. Cheriton From doc2query to docTTTTTquery , 2019 .

[18]  Jimmy J. Lin,et al.  Distilling Dense Representations for Ranking using Tightly-Coupled Teachers , 2020, ArXiv.

[19]  Svetlana Lazebnik,et al.  Iterative quantization: A procrustean approach to learning binary codes , 2011, CVPR 2011.

[20]  Cordelia Schmid,et al.  Product Quantization for Nearest Neighbor Search , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21]  Yizhou Sun,et al.  Differentiable Product Quantization for End-to-End Embedding Compression , 2019, ICML.

[22]  Torsten Suel,et al.  Learning Passage Impacts for Inverted Indexes , 2021, SIGIR.

[23]  D. Sculley,et al.  Web-scale k-means clustering , 2010, WWW '10.

[24]  Luyu Gao,et al.  COIL: Revisit Exact Lexical Match in Information Retrieval with Contextualized Inverted List , 2021, NAACL.

[25]  King-Sun Fu,et al.  IEEE Transactions on Pattern Analysis and Machine Intelligence Publication Information , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[26]  Ye Li,et al.  Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval , 2020, ArXiv.

[27]  Jian Sun,et al.  Optimized Product Quantization , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[28]  Yury A. Malkov,et al.  Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[29]  Ikuya Yamada,et al.  Efficient Passage Retrieval with Hashing for Open-domain Question Answering , 2021, ACL.

[30]  Sanjiv Kumar,et al.  Accelerating Large-Scale Inference with Anisotropic Vector Quantization , 2019, ICML.

[31]  Hua Wu,et al.  RocketQA: An Optimized Training Approach to Dense Passage Retrieval for Open-Domain Question Answering , 2020, NAACL.

[32]  Thomas Wolf,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[33]  Jimmy J. Lin,et al.  A Few Brief Notes on DeepImpact, COIL, and a Conceptual Framework for Information Retrieval Techniques , 2021, ArXiv.

[34]  Yiqun Liu,et al.  RepBERT: Contextualized Text Embeddings for First-Stage Retrieval , 2020, ArXiv.

[35]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[36]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[37]  Jiafeng Guo,et al.  Jointly Optimizing Query Encoder and Product Quantization to Improve Retrieval Performance , 2021, CIKM.

[38]  Jiafeng Guo,et al.  Optimizing Dense Retrieval Model Training with Hard Negatives , 2021, SIGIR.

[39]  Jianfeng Gao,et al.  A Human Generated MAchine Reading COmprehension Dataset , 2018 .