Learning Cross-Lingual Word Embeddings from Twitter via Distant Supervision

Cross-lingual embeddings represent the meaning of words from different languages in the same vector space. Recent work has shown that it is possible to construct such representations by aligning independently learned monolingual embedding spaces, and that accurate alignments can be obtained even without external bilingual data. In this paper we explore a research direction that has been surprisingly neglected in the literature: leveraging noisy user-generated text to learn cross-lingual embeddings particularly tailored towards social media applications. While the noisiness and informal nature of the social media genre poses additional challenges to cross-lingual embedding methods, we find that it also provides key opportunities due to the abundance of code-switching and the existence of a shared vocabulary of emoji and named entities. Our contribution consists of a very simple post-processing step that exploits these phenomena to significantly improve the performance of state-of-the-art alignment methods.

[1]  Dong Wang,et al.  Normalized Word Embedding and Orthogonal Transform for Bilingual Word Translation , 2015, NAACL.

[2]  Davide Buscaldi,et al.  From humor recognition to irony detection: The figurative language of social media , 2012, Data Knowl. Eng..

[3]  Anna Korhonen,et al.  On the Role of Seed Lexicons in Learning Bilingual Word Embeddings , 2016, ACL.

[4]  Xiaojun Wan,et al.  Cross-Lingual Sentiment Classification with Bilingual Document Representation Learning , 2016, ACL.

[5]  Iyad Rahwan,et al.  Using millions of emoji occurrences to learn any-domain representations for detecting sentiment, emotion and sarcasm , 2017, EMNLP.

[6]  Kalina Bontcheva,et al.  Twitter Part-of-Speech Tagging for All: Overcoming Sparse and Noisy Data , 2013, RANLP.

[7]  Steven Schockaert,et al.  Meemi: Finding the Middle Ground in Cross-lingual Word Embeddings , 2019 .

[8]  Ming Zhou,et al.  Learning Sentiment-Specific Word Embedding for Twitter Sentiment Classification , 2014, ACL.

[9]  Guillaume Lample,et al.  Phrase-Based & Neural Unsupervised Machine Translation , 2018, EMNLP.

[10]  Craig MacDonald,et al.  Using word embeddings in Twitter election classification , 2016, Information Retrieval Journal.

[11]  Jeremy Barnes,et al.  Bilingual Sentiment Embeddings: Joint Projection of Sentiment Across Languages , 2018, ACL.

[12]  Samuel L. Smith,et al.  Offline bilingual word vectors, orthogonal transformations and the inverted softmax , 2017, ICLR.

[13]  Fernando Diaz,et al.  Processing Social Media Messages in Mass Emergency: Survey Summary , 2018, WWW.

[14]  Saif Mohammad,et al.  Sentiment after Translation: A Case-Study on Arabic Social Media Posts , 2015, NAACL.

[15]  Horacio Saggion,et al.  How Cosmopolitan Are Emojis?: Exploring Emojis Usage and Meaning over Different Languages with Distributional Semantics , 2016, ACM Multimedia.

[16]  Guillaume Lample,et al.  Word Translation Without Parallel Data , 2017, ICLR.

[17]  Anders Søgaard,et al.  A Survey of Cross-lingual Word Embedding Models , 2017, J. Artif. Intell. Res..

[18]  Goran Glavas,et al.  How to (Properly) Evaluate Cross-Lingual Word Embeddings: On Strong Baselines, Comparative Analyses, and Some Misconceptions , 2019, ACL.

[19]  Tao Yu,et al.  Cross-lingual sentiment transfer with limited resources , 2018, Machine Translation.

[20]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[21]  Wesley De Neve,et al.  Multimedia Lab @ ACL WNUT NER Shared Task: Named Entity Recognition for Twitter Microposts using Distributed Word Representations , 2015, NUT@IJCNLP.

[22]  Claire Cardie,et al.  Adversarial Deep Averaging Networks for Cross-Lingual Sentiment Classification , 2016, TACL.

[23]  Georgiana Dinu,et al.  Improving zero-shot learning by mitigating the hubness problem , 2014, ICLR.

[24]  Eneko Agirre,et al.  A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings , 2018, ACL.

[25]  Eneko Agirre,et al.  Unsupervised Neural Machine Translation , 2017, ICLR.

[26]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[27]  Anders Søgaard,et al.  On the Limitations of Unsupervised Bilingual Dictionary Induction , 2018, ACL.

[28]  Cees Snoek,et al.  Image2Emoji: Zero-shot Emoji Prediction for Visual Media , 2015, ACM Multimedia.

[29]  Navneet Kaur,et al.  Opinion mining and sentiment analysis , 2016, 2016 3rd International Conference on Computing for Sustainable Global Development (INDIACom).

[30]  Meng Zhang,et al.  Earth Mover’s Distance Minimization for Unsupervised Bilingual Lexicon Induction , 2017, EMNLP.

[31]  Miguel A. Alonso,et al.  Supervised sentiment analysis in multilingual environments , 2017, Inf. Process. Manag..

[32]  Manaal Faruqui,et al.  Cross-lingual Models of Word Embeddings: An Empirical Comparison , 2016, ACL.

[33]  Luis Alfonso Ureña López,et al.  Polarity classification for Spanish tweets using the COST corpus , 2015, J. Inf. Sci..

[34]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[35]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[36]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[37]  Fatih Uzdilli,et al.  A Twitter Corpus and Benchmark Resources for German Sentiment Analysis , 2017, SocialNLP@EACL.

[38]  Yiming Yang,et al.  Unsupervised Cross-lingual Transfer of Word Embedding Spaces , 2018, EMNLP.

[39]  Akshi Kumar,et al.  Sentiment Analysis on Twitter , 2012 .

[40]  Eneko Agirre,et al.  Generalizing and Improving Bilingual Word Embedding Mappings with a Multi-Step Framework of Linear Transformations , 2018, AAAI.

[41]  Malvina Nissim,et al.  Overview of the Evalita 2016 SENTIment POLarity Classification Task , 2014, CLiC-it/EVALITA.

[42]  Maria Leonor Pacheco,et al.  of the Association for Computational Linguistics: , 2001 .

[43]  Quoc V. Le,et al.  Exploiting Similarities among Languages for Machine Translation , 2013, ArXiv.

[44]  Julio Villena-Román,et al.  The democratization of deep learning in TASS 2017 , 2018, Proces. del Leng. Natural.

[45]  Preslav Nakov,et al.  SemEval-2016 Task 4: Sentiment Analysis in Twitter. , 2019 .

[46]  Guillaume Lample,et al.  Unsupervised Machine Translation Using Monolingual Corpora Only , 2017, ICLR.

[47]  Houfeng Wang,et al.  Cross-Lingual Mixture Model for Sentiment Classification , 2012, ACL.

[48]  Eneko Agirre,et al.  Learning bilingual word embeddings with (almost) no bilingual data , 2017, ACL.

[49]  Manaal Faruqui,et al.  Improving Vector Space Word Representations Using Multilingual Correlation , 2014, EACL.

[50]  José Carlos González,et al.  TASS - Workshop on Sentiment Analysis at SEPLN , 2013, Proces. del Leng. Natural.

[51]  Claire Cardie,et al.  Unsupervised Multilingual Word Embeddings , 2018, EMNLP.

[52]  Steven Schockaert,et al.  Improving Cross-Lingual Word Embeddings by Meeting in the Middle , 2018, EMNLP.