论文信息 - Translation Invariant Word Embeddings

Translation Invariant Word Embeddings

This work focuses on the task of finding latent vector representations of the words in a corpus. In particular, we address the issue of what to do when there are multiple languages in the corpus. Prior work has, among other techniques, used canonical correlation analysis to project pre-trained vectors in two languages into a common space. We propose a simple and scalable method that is inspired by the notion that the learned vector representations should be invariant to translation between languages. We show empirically that our method outperforms prior work on multilingual tasks, matches the performance of prior work on monolingual tasks, and scales linearly with the size of the input data (and thus the number of languages being embedded).

[1] J. R. Firth,et al. A Synopsis of Linguistic Theory, 1930-1955 , 1957 .

[2] Manaal Faruqui,et al. Improving Vector Space Word Representations Using Multilingual Correlation , 2014, EACL.

[3] Brett W. Bader,et al. Enhancing Multilingual Latent Semantic Analysis with Term Alignment Information , 2008, COLING.

[4] Chris Callison-Burch,et al. PPDB: The Paraphrase Database , 2013, NAACL.

[5] Jeffrey Dean,et al. Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[6] T. Landauer,et al. Indexing by Latent Semantic Analysis , 1990 .

[7] Richard A. Harshman,et al. Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[8] Manaal Faruqui,et al. Community Evaluation and Exchange of Word Vectors at wordvectors.org , 2014, ACL.

[9] David Yarowsky,et al. Cross-lingual Dependency Parsing Based on Distributed Representations , 2015, ACL.

[10] Robert H. Halstead,et al. Matrix Computations , 2011, Encyclopedia of Parallel Computing.