论文信息 - Cross-Lingual Word Embeddings

Cross-Lingual Word Embeddings

The representation of words across languages is of interest since the early days of interlingual machine translation, as it allows to connect the meaning of words in different languages and to generalize lexical semantic properties and relations across languages (Hutchins 2000). Structured representations such as multilingual lexical knowledge bases represent polysemy, language internal and cross-lingual relations, but they require costly manual construction and maintenance (Vossen 1998). Alternatively, corpusbased methods have been used to automatically induce monolingual word representations like word embeddings with great success (Mikolov et al. 2013). Word embeddings represent the words in the vocabulary of a language as vectors in n-dimensional space, where words which are similar being located close to each other. Cross-lingual word embeddings (CLWE for short) extend the idea, and represent translation-equivalent words from two (or more) languages close to each other in a common, cross-lingual space. The interest in cross-lingual word embeddings has grown in the last years. This is partly for their success in cross-lingual transfer, where NLP tools trained in a resourcerich language such as English are transferred to another language with smaller or no annotated data. For instance, given training data for a text-classification task in English, a model using CLWE can classify foreign language documents. Beyond language pairs, CLWE allow to represent words of several languages in a common space, and thus pave the way to build multilingual NLP tools that use the same model to process text in different languages. This comprehensive and, at the same time, dense book has been written by Anders Søgaard, Ivan Vulić, Sebastian Ruder and Manaal Faruqui. It covers all key issues as well as the most relevant work in CLWE, including the most recent research in this vibrant research area up to May 2019. It does a great work at organizing different approaches in a typology, according to the kind of bilingual resources needed, and differentiating word-level, sentence-level and document-level models. The book also covers extensions to CLWE that are able to represent multiple languages in the same space, as well as

Eneko Agirre

[1] W. John Hutchins. Early years in machine translation : memoirs and biographies of pioneers , 2000 .

[2] Piek Vossen,et al. EuroWordNet: A multilingual database with lexical semantic networks , 1998, Springer Netherlands.

[3] Anders Søgaard,et al. A Survey of Cross-lingual Word Embedding Models , 2017, J. Artif. Intell. Res..

[4] Jeffrey Dean,et al. Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.