Fast Derivation of Cross-lingual Document Vectors from Self-attentive Neural Machine Translation Model

A universal cross-lingual representation of documents, which can capture the underlying semantics is very useful in many natural language processing tasks. In this paper, we develop a new document vectorization method which effectively selects the most salient sequential patterns from the inputs to create document vectors via a self-attention mechanism using a neural machine translation (NMT) model. The model used by our method can be trained with parallel corpora that are unrelated to the task at hand. During testing, our method will take a monolingual document and convert it into a “Neural machine Translation framework based cross-lingual Document Vector” (NTDV). NTDV has two comparative advantages. Firstly, the NTDV can be produced by the forward-pass of the encoder in the NMT, and the process is very fast and does not require any training/optimization. Secondly, our model can be conveniently adapted from a pair of existing attention-based NMT models, and the training requirement on parallel corpus can be reduced significantly. In a cross-lingual document classification task, our NTDV embeddings surpass the previous stateof-the-art performance in the English-to-German classification test, and, to our best knowledge, it also achieves the best performance among the fast decoding methods in the German-toEnglish classification test.

[1]  Yoshua Bengio,et al.  BilBOWA: Fast Bilingual Distributed Representations without Word Alignments , 2014, ICML.

[2]  Jiafeng Guo,et al.  Analysis of the Paragraph Vector Model for Information Retrieval , 2016, ICTIR.

[3]  Ivan Titov,et al.  Inducing Crosslingual Distributed Representations of Words , 2012, COLING.

[4]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[5]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[6]  Philipp Koehn,et al.  Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[7]  Yang Liu,et al.  Context Gates for Neural Machine Translation , 2016, TACL.

[8]  Achim Rettinger,et al.  Bilingual Word Embeddings from Parallel and Non-parallel Corpora for Cross-Language Text Classification , 2016, NAACL.

[9]  Guillaume Wenzek,et al.  Trans-gram, Fast Cross-lingual Word-embeddings , 2015, EMNLP.

[10]  Rico Sennrich,et al.  Nematus: a Toolkit for Neural Machine Translation , 2017, EACL.

[11]  Phil Blunsom,et al.  Multilingual Models for Compositional Distributed Semantics , 2014, ACL.

[12]  Yang Liu,et al.  Visualizing and Understanding Neural Machine Translation , 2017, ACL.

[13]  Philip Gage,et al.  A new algorithm for data compression , 1994 .

[14]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[15]  Quoc V. Le,et al.  Document Embedding with Paragraph Vectors , 2015, ArXiv.

[16]  Hugo Larochelle,et al.  An Autoencoder Approach to Learning Bilingual Word Representations , 2014, NIPS.

[17]  Bowen Zhou,et al.  A Structured Self-attentive Sentence Embedding , 2017, ICLR.

[18]  Christopher D. Manning,et al.  Bilingual Word Representations with Monolingual Quality in Mind , 2015, VS@HLT-NAACL.

[19]  Christopher D. Manning,et al.  Learning Distributed Representations for Multilingual Text Sequences , 2015, VS@HLT-NAACL.

[20]  André F. T. Martins,et al.  Jointly Learning to Embed and Predict with Multiple Languages , 2016, ACL.

[21]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[22]  Massih-Reza Amini,et al.  Learning from Multiple Partially Observed Views - an Application to Multilingual Text Categorization , 2009, NIPS.

[23]  Olivier Pietquin,et al.  MultiVec: a Multilingual and Multilevel Representation Learning Toolkit for NLP , 2016, LREC.

[24]  Sanja Fidler,et al.  Skip-Thought Vectors , 2015, NIPS.

[25]  Rico Sennrich,et al.  Edinburgh Neural Machine Translation Systems for WMT 16 , 2016, WMT.