BISON: BM25-weighted Self-Attention Framework for Multi-Fields Document Search

Recent breakthrough in natural language processing has advanced the information retrieval from keyword match to semantic vector search. To map query and documents into semantic vectors, self-attention models are being widely used. However, typical self-attention models, like Transformer, lack prior knowledge to distinguish the importance of different tokens, which has been proved to play a critical role in information retrieval tasks. In addition to this, when applying WordPiece tokenization, a rare word may be split into several different tokens. How to translate word-level prior knowledge into WordPiece tokens becomes a new challenge for the semantic representation generation. Moreover, web documents usually have multiple fields. Due to the heterogeneity of different fields, simple combination is not a good choice. In this paper, We propose a novel BM25-weighted Self-Attention framework (BISON) for web document search. By leveraging BM25 as prior weights, BISON learns weighted attention scores jointly with query matrix Q and key matrix K. We also present an efficient whole word weight sharing solution to mitigate prior knowledge discrepancy between words and WordPiece tokens. Furthermore, BISON effectively combines multiple fields by placing different fields into different segments. We demonstrate BISON is more efficient to capture the topical and semantic representation both in query and document. Intrinsic evaluation and experiments conducted on public data sets reveal BISON to be a general framework for document ranking task. It outperforms BERT and other modern models while retaining the same model complexity with BERT.

[1]  Jimmy J. Lin,et al.  Simple Applications of BERT for Ad Hoc Document Retrieval , 2019, ArXiv.

[2]  W. Bruce Croft,et al.  A Deep Relevance Matching Model for Ad-hoc Retrieval , 2016, CIKM.

[3]  Xueqi Cheng,et al.  Text Matching as Image Recognition , 2016, AAAI.

[4]  Hai Zhao,et al.  Dual Co-Matching Network for Multi-choice Reading Comprehension , 2020, AAAI.

[5]  Alec Radford,et al.  Improving Language Understanding by Generative Pre-Training , 2018 .

[6]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Kyunghyun Cho,et al.  Passage Re-ranking with BERT , 2019, ArXiv.

[8]  Jimmy J. Lin,et al.  End-to-End Open-Domain Question Answering with BERTserini , 2019, NAACL.

[9]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[10]  Koray Kavukcuoglu,et al.  Learning word embeddings efficiently with noise-contrastive estimation , 2013, NIPS.

[11]  Nan Hua,et al.  Universal Sentence Encoder , 2018, ArXiv.

[12]  Jaana Kekäläinen,et al.  Cumulated gain-based evaluation of IR techniques , 2002, TOIS.

[13]  Hang Li,et al.  A Deep Architecture for Matching Short Texts , 2013, NIPS.

[14]  Yu Sun,et al.  ERNIE: Enhanced Representation through Knowledge Integration , 2019, ArXiv.

[15]  Stephen E. Robertson,et al.  Simple BM25 extension to multiple weighted fields , 2004, CIKM '04.

[16]  Sebastian Bruch,et al.  TF-Ranking: Scalable TensorFlow Library for Learning-to-Rank , 2018, KDD.

[17]  Hang Li,et al.  Convolutional Neural Network Architectures for Matching Natural Language Sentences , 2014, NIPS.

[18]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[19]  Jungwon Lee,et al.  T-GSA: Transformer with Gaussian-Weighted Self-Attention for Speech Enhancement , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Yiming Yang,et al.  XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[21]  Zhiyuan Liu,et al.  Understanding the Behaviors of BERT in Ranking , 2019, ArXiv.

[22]  Zhuyun Dai,et al.  Context-Aware Sentence/Passage Term Importance Estimation For First Stage Retrieval , 2019, ArXiv.

[23]  Jimmy J. Lin,et al.  Multi-Stage Document Ranking with BERT , 2019, ArXiv.

[24]  Jianfeng Gao,et al.  A Human Generated MAchine Reading COmprehension Dataset , 2018 .

[25]  George Kurian,et al.  Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[26]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[27]  Bhaskar Mitra,et al.  Neural Ranking Models with Multiple Document Fields , 2017, WSDM.

[28]  Larry P. Heck,et al.  Learning deep structured semantic models for web search using clickthrough data , 2013, CIKM.

[29]  Paul N. Bennett,et al.  Generic Intent Representation in Web Search , 2019, SIGIR.

[30]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[31]  Ehud Rivlin,et al.  Placing search in context: the concept revisited , 2002, TOIS.

[32]  Yelong Shen,et al.  Learning semantic representations using convolutional neural networks for web search , 2014, WWW.

[33]  Bhaskar Mitra,et al.  An Introduction to Neural Information Retrieval , 2018, Found. Trends Inf. Retr..

[34]  Bhaskar Mitra,et al.  Improving Document Ranking with Dual Word Embeddings , 2016, WWW.

[35]  Nick Craswell,et al.  Learning to Match using Local and Distributed Representations of Text for Web Search , 2016, WWW.

[36]  Stephen E. Robertson,et al.  Okapi at TREC-3 , 1994, TREC.

[37]  Stephen E. Robertson,et al.  GatfordCentre for Interactive Systems ResearchDepartment of Information , 1996 .