The 2019 BBN Cross-lingual Information Retrieval System

In this paper, we describe a cross-lingual information retrieval (CLIR) system that, given a query in English, and a set of audio and text documents in a foreign language, can return a scored list of relevant documents, and present findings in a summary form in English. Foreign audio documents are first transcribed by a state-of-the-art pretrained multilingual speech recognition model that is finetuned to the target language. For text documents, we use multiple multilingual neural machine translation (MT) models to achieve good translation results, especially for low/medium resource languages. The processed documents and queries are then scored using a probabilistic CLIR model that makes use of the probability of translation from GIZA translation tables and scores from a Neural Network Lexical Translation Model (NNLTM). Additionally, advanced score normalization, combination, and thresholding schemes are employed to maximize the Average Query Weighted Value (AQWV) scores. The CLIR output, together with multiple translation renderings, are selected and translated into English snippets via a summarization model. Our turnkey system is language agnostic and can be quickly trained for a new low-resource language in few days.

[1]  Rico Sennrich,et al.  Improving Neural Machine Translation Models with Monolingual Data , 2015, ACL.

[2]  Jonathan Pool,et al.  PanLex: Building a Resource for Panlingual Lexical Translation , 2014, LREC.

[3]  Richard M. Schwartz,et al.  Enhancing low resource keyword spotting with automatically retrieved web documents , 2015, INTERSPEECH.

[4]  Marjorie Freedman,et al.  SARAL: A Low-Resource Cross-Lingual Domain-Focused Information Retrieval System for Effective Rapid Document Triage , 2019, ACL.

[5]  Jan Silovský,et al.  Sage: The New BBN Speech Processing Platform , 2016, INTERSPEECH.

[6]  Jörg Tiedemann,et al.  Parallel Data, Tools and Interfaces in OPUS , 2012, LREC.

[7]  Samy Bengio,et al.  Tensor2Tensor for Neural Machine Translation , 2018, AMTA.

[8]  Yann Dauphin,et al.  Pay Less Attention with Lightweight and Dynamic Convolutions , 2019, ICLR.

[9]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[10]  Richard M. Schwartz,et al.  Two-Stage Data Augmentation for Low-Resourced Speech Recognition , 2016, INTERSPEECH.

[11]  Mikel L. Forcada,et al.  ParaCrawl: Web-scale parallel corpora for the languages of the EU , 2019, MTSummit.

[12]  Hermann Ney,et al.  Joint-sequence models for grapheme-to-phoneme conversion , 2008, Speech Commun..

[13]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[14]  David Maxwell,et al.  A Study of Snippet Length and Informativeness: Behaviour, Performance and User Experience , 2017, SIGIR.

[15]  Richard M. Schwartz,et al.  Neural-Network Lexical Translation for Cross-lingual IR from Text and Speech , 2019, SIGIR.

[16]  Richard M. Schwartz,et al.  Score normalization and system combination for improved keyword spotting , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[17]  Taku Kudo,et al.  SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing , 2018, EMNLP.

[18]  Jeff Z. Ma,et al.  Optimizing Multilingual Knowledge Transfer for Time-Delay Neural Networks with Low-Rank Factorization , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Yiming Wang,et al.  A Pruned Rnnlm Lattice-Rescoring Algorithm for Automatic Speech Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Myle Ott,et al.  fairseq: A Fast, Extensible Toolkit for Sequence Modeling , 2019, NAACL.

[21]  Hui Lin,et al.  A Class of Submodular Functions for Document Summarization , 2011, ACL.

[22]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.