Interpreting Dense Retrieval as Mixture of Topics

Dense Retrieval (DR) reaches state-of-the-art results in first-stage retrieval, but little is known about the mechanisms that contribute to its success. Therefore, in this work, we conduct an interpretation study of recently proposed DR models. Specifically, we first discretize the embeddings output by the document and query encoders. Based on the discrete representations, we analyze the attribution of input tokens. Both qualitative and quantitative experiments are carried out on public test collections. Results suggest that DR models pay attention to different aspects of input and extract various high-level topic representations. Therefore, we can regard the representations learned by DR models as a mixture of high-level topics.

[1]  Cordelia Schmid,et al.  Product Quantization for Nearest Neighbor Search , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  David M. Blei,et al.  Probabilistic topic models , 2012, Commun. ACM.

[3]  Jiafeng Guo,et al.  Learning Discrete Representations via Constrained Clustering for Effective and Efficient Dense Retrieval , 2021, WSDM.

[4]  Luyu Gao,et al.  Complementing Lexical Retrieval with Semantic Residual Embedding , 2020, ArXiv.

[5]  Linjun Yang,et al.  Embedding-based Retrieval in Facebook Search , 2020, KDD.

[6]  Ye Li,et al.  Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval , 2020, ArXiv.

[7]  Luyu Gao,et al.  Unsupervised Corpus Aware Language Model Pre-training for Dense Passage Retrieval , 2021, ACL.

[8]  Danqi Chen,et al.  Dense Passage Retrieval for Open-Domain Question Answering , 2020, EMNLP.

[9]  Jiafeng Guo,et al.  Jointly Optimizing Query Encoder and Product Quantization to Improve Retrieval Performance , 2021, CIKM.

[10]  Jian Sun,et al.  Optimized Product Quantization , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  Allan Hanbury,et al.  Efficiently Teaching an Effective Dense Retriever with Balanced Topic Aware Sampling , 2021, SIGIR.

[12]  King-Sun Fu,et al.  IEEE Transactions on Pattern Analysis and Machine Intelligence Publication Information , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13]  Hua Wu,et al.  RocketQA: An Optimized Training Approach to Dense Passage Retrieval for Open-Domain Question Answering , 2020, NAACL.

[14]  Yiqun Liu,et al.  RepBERT: Contextualized Text Embeddings for First-Stage Retrieval , 2020, ArXiv.

[15]  Ankur Taly,et al.  Axiomatic Attribution for Deep Networks , 2017, ICML.

[16]  Jiafeng Guo,et al.  Optimizing Dense Retrieval Model Training with Hard Negatives , 2021, SIGIR.

[17]  Jianfeng Gao,et al.  A Human Generated MAchine Reading COmprehension Dataset , 2018 .

[18]  Bhaskar Mitra,et al.  Overview of the TREC 2019 deep learning track , 2020, ArXiv.

[19]  Yiqun Liu,et al.  An Analysis of BERT in Document Ranking , 2020, SIGIR.

[20]  Jacob Eisenstein,et al.  Sparse, Dense, and Attentional Representations for Text Retrieval , 2021, Transactions of the Association for Computational Linguistics.