Document Retrieval Using Deep Learning

Document Retrieval has seen significant advancements in the last few decades. Latest developments in Natural Language Processing have made it possible to incorporate context and complex lexical patterns to document representations. This opens new possibilities for developing advanced retrieval systems. Traditional approaches for indexing documents suggest averaging word and sentence encoding to form fixed-length document embeddings. However, the common bag-of-word approach fails to incorporate the semantic context, which can be critical for understanding document-query relevancy. We address this by leveraging Bidirectional Encoder Representations from Transformers (BERT) to create semantically rich document embeddings. BERT compensates the limitations of the Term Frequency Inverse Document Frequency (TF-IDF) by incorporating contextual embeddings. In this paper, we propose an ensemble of BERT and TF-IDF for a document retrieval system, where TFIDF and BERT together score the documents against a query, to retrieve a final set of top K documents. We critically compare our model against the standard TF-IDF method and demonstrate a significant performance improvement on MS MARCO data (Microsoft-curated data of Bing queries).

[1]  Yelong Shen,et al.  A Latent Semantic Model with Convolutional-Pooling Structure for Information Retrieval , 2014, CIKM.

[2]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[3]  Yelong Shen,et al.  Learning semantic representations using convolutional neural networks for web search , 2014, WWW.

[4]  Steven Bird,et al.  NLTK: The Natural Language Toolkit , 2002, ACL.

[5]  Jianfeng Gao,et al.  A Human Generated MAchine Reading COmprehension Dataset , 2018 .

[6]  Larry P. Heck,et al.  Learning deep structured semantic models for web search using clickthrough data , 2013, CIKM.

[7]  Stephen E. Robertson,et al.  Okapi at TREC-3 , 1994, TREC.

[8]  Stephen E. Robertson,et al.  GatfordCentre for Interactive Systems ResearchDepartment of Information , 1996 .

[9]  Djoerd Hiemstra,et al.  A probabilistic justification for using tf×idf term weighting in information retrieval , 2000, International Journal on Digital Libraries.

[10]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[11]  W. Bruce Croft,et al.  A Deep Relevance Matching Model for Ad-hoc Retrieval , 2016, CIKM.

[12]  Xueqi Cheng,et al.  Text Matching as Image Recognition , 2016, AAAI.

[13]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[14]  Hang Li,et al.  Convolutional Neural Network Architectures for Matching Natural Language Sentences , 2014, NIPS.

[15]  Jaewoo Kang,et al.  BioBERT: a pre-trained biomedical language representation model for biomedical text mining , 2019, Bioinform..

[16]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[17]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.