Leveraging Advantages of Interactive and Non-Interactive Models for Vector-Based Cross-Lingual Information Retrieval

Interactive and non-interactive model are the two de-facto standard frameworks in vector-based cross-lingual information retrieval (V-CLIR), which embed queries and documents in synchronous and asynchronous fashions, respectively. From the retrieval accuracy and computational efficiency perspectives, each model has its own superiority and shortcoming. In this paper, we propose a novel framework to leverage the advantages of these two paradigms. Concretely, we introduce semi-interactive mechanism, which builds our model upon non-interactive architecture but encodes each document together with its associated multilingual queries. Accordingly, cross-lingual features can be better learned like an interactive model. Besides, we further transfer knowledge from a well-trained interactive model to ours by reusing its word embeddings and adopting knowledge distillation. Our model is initialized from a multilingual pre-trained language model M-BERT, and evaluated on two open-resource CLIR datasets derived from Wikipedia and an in-house dataset collected from a real-world search engine. Extensive analyses reveal that our methods significantly boost the retrieval accuracy while maintaining the computational efficiency. 1

[1]  Douglas W. Oard,et al.  Dictionary-based techniques for cross-language information retrieval , 2005, Inf. Process. Manag..

[2]  Ray Kurzweil,et al.  Multilingual Universal Sentence Encoder for Semantic Retrieval , 2019, ACL.

[3]  Kevin Duh,et al.  Cross-Lingual Learning-to-Rank with Shared Representations , 2018, NAACL.

[4]  Xiaoying Tai,et al.  An information retrieval model based on vector space method by supervised learning , 2002, Inf. Process. Manag..

[5]  Marie-Francine Moens,et al.  Monolingual and Cross-Lingual Information Retrieval Models Based on (Bilingual) Word Embeddings , 2015, SIGIR.

[6]  Luca Dini,et al.  Language Identification Strategies for Cross Language Information Retrieval , 2010, CLEF.

[7]  R. Epstein,et al.  Area under the curve: A metric for patient subjective responses in episodic diseases , 1995, Quality of Life Research.

[8]  Michael R. Lyu,et al.  Learning latent semantic relations from clickthrough data for query suggestion , 2008, CIKM '08.

[9]  Jaana Kekäläinen,et al.  Cumulated gain-based evaluation of IR techniques , 2002, TOIS.

[10]  W. Bruce Croft,et al.  A Deep Relevance Matching Model for Ad-hoc Retrieval , 2016, CIKM.

[11]  Dan Wu,et al.  A Study of Query Translation Using Google Machine Translation System , 2010, 2010 International Conference on Computational Intelligence and Software Engineering.

[12]  Filip Radlinski,et al.  How does clickthrough data reflect retrieval quality? , 2008, CIKM '08.

[13]  Haibo Zhang,et al.  Exploiting Neural Query Translation into Cross Lingual Information Retrieval , 2020, ArXiv.

[14]  Kevin Duh,et al.  Robust Document Representations for Cross-Lingual Information Retrieval in Low-Resource Settings , 2019, MTSummit.

[15]  Preslav Nakov,et al.  Cross-Language Question Re-Ranking , 2017, SIGIR.

[16]  Haibo Zhang,et al.  Constraint Translation Candidates: A Bridge between Neural Query Translation and Cross-lingual Information Retrieval , 2020, ArXiv.

[17]  Carlo Strapparava,et al.  Corpus-based and Knowledge-based Measures of Text Semantic Similarity , 2006, AAAI.

[18]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[19]  Matt J. Kusner,et al.  From Word Embeddings To Document Distances , 2015, ICML.

[20]  Haibo Zhang,et al.  Domain Transfer based Data Augmentation for Neural Query Translation , 2020, COLING.

[21]  Ruofei Zhang,et al.  TwinBERT: Distilling Knowledge to Twin-Structured Compressed BERT Models for Large-Scale Retrieval , 2020, CIKM.

[22]  Yinan Zhang,et al.  Information Retrieval as Card Playing: A Formal Model for Optimizing Interactive Retrieval Interface , 2015, SIGIR.

[23]  Bhaskar Mitra,et al.  An Introduction to Neural Information Retrieval , 2018, Found. Trends Inf. Retr..

[24]  Zhiyuan Liu,et al.  End-to-End Neural Ad-hoc Ranking with Kernel Pooling , 2017, SIGIR.

[25]  Iryna Gurevych,et al.  Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks , 2019, EMNLP.

[26]  Yu Cheng,et al.  Patient Knowledge Distillation for BERT Model Compression , 2019, EMNLP.

[27]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[28]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[29]  Kevin Duh,et al.  CLIRMatrix: A Massively Large Collection of Bilingual and Multilingual Datasets for Cross-Lingual Information Retrieval , 2020, EMNLP.

[30]  Martin Jaggi,et al.  Robust Cross-lingual Embeddings from Parallel Sentences , 2019, ArXiv.

[31]  Lidong Bing,et al.  Cross-Lingual Low-Resource Set-to-Description Retrieval for Global E-Commerce , 2020, AAAI.

[32]  Edouard Grave,et al.  Distilling Knowledge from Reader to Retriever for Question Answering , 2020, ArXiv.

[33]  Kevin Duh,et al.  CLIReval: Evaluating Machine Translation as a Cross-Lingual Information Retrieval Task , 2020, ACL.

[34]  Rich Caruana,et al.  Do Deep Nets Really Need to be Deep? , 2013, NIPS.

[35]  Jiarun Cao,et al.  Whitening Sentence Representations for Better Semantics and Faster Retrieval , 2021, ArXiv.

[36]  Yelong Shen,et al.  Learning semantic representations using convolutional neural networks for web search , 2014, WWW.

[37]  Philipp Koehn,et al.  Statistical Significance Tests for Machine Translation Evaluation , 2004, EMNLP.

[38]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[39]  Vishal Gupta,et al.  Recent automatic text summarization techniques: a survey , 2016, Artificial Intelligence Review.

[40]  Michael R. Lyu,et al.  Information Aggregation for Multi-Head Attention with Routing-by-Agreement , 2019, NAACL.

[41]  Ion Androutsopoulos,et al.  Deep Relevance Ranking Using Enhanced Document-Query Interactions , 2018, EMNLP.

[42]  Goran Glavas,et al.  Unsupervised Cross-Lingual Information Retrieval Using Monolingual Data Only , 2018, SIGIR.

[43]  Ray Kurzweil,et al.  Learning Cross-Lingual Sentence Representations via a Multi-task Dual-Encoder Model , 2019, RepL4NLP@ACL.

[44]  Larry P. Heck,et al.  Learning deep structured semantic models for web search using clickthrough data , 2013, CIKM.

[45]  Dong Zhou,et al.  Translation techniques in cross-language information retrieval , 2012, CSUR.

[46]  Gregory Grefenstette,et al.  Cross-Language Information Retrieval , 1998, The Springer International Series on Information Retrieval.

[47]  Di He,et al.  Representation Degeneration Problem in Training Natural Language Generation Models , 2019, ICLR.

[48]  Heng Tao Shen,et al.  Principal Component Analysis , 2009, Encyclopedia of Biometrics.

[49]  William Hartmann,et al.  Cross-lingual Information Retrieval with BERT , 2020, CLSSTS.

[50]  Choon Hui Teo,et al.  Semantic Product Search , 2019, KDD.

[51]  Omer Levy,et al.  Dependency-Based Word Embeddings , 2014, ACL.

[52]  Jimmy J. Lin,et al.  Applying BERT to Document Retrieval with Birch , 2019, EMNLP.

[53]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.