Sentiment Analysis for Hinglish Code-mixed Tweets by means of Cross-lingual Word Embeddings

This paper investigates the use of unsupervised cross-lingual embeddings for solving the problem of code-mixed social media text understanding. We specifically investigate the use of these embeddings for a sentiment analysis task for Hinglish Tweets, viz. English combined with (transliterated) Hindi. In a first step, baseline models, initialized with monolingual embeddings obtained from large collections of tweets in English and code-mixed Hinglish, were trained. In a second step, two systems using cross-lingual embeddings were researched, being (1) a supervised classifier and (2) a transfer learning approach trained on English sentiment data and evaluated on code-mixed data. We demonstrate that incorporating cross-lingual embeddings improves the results (F1-score of 0.635 versus a monolingual baseline of 0.616), without any parallel data required to train the cross-lingual embeddings. In addition, the results show that the cross-lingual embeddings not only improve the results in a fully supervised setting, but they can also be used as a base for distant supervision, by training a sentiment model in one of the source languages and evaluating on the other language projected in the same space. The transfer learning experiments result in an F1-score of 0.556, which is almost on par with the supervised settings and speak to the robustness of the cross-lingual embeddings approach.

[1]  Dipti Misra Sharma,et al.  Shallow Parsing Pipeline - Hindi-English Code-Mixed Social Media Text , 2016, NAACL.

[2]  Eneko Agirre,et al.  A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings , 2018, ACL.

[3]  Vasudeva Varma,et al.  Hindi Subjective Lexicon: A Lexical Resource for Hindi Adjective Polarity Classification , 2012, LREC.

[4]  Claire Cardie,et al.  Unsupervised Multilingual Word Embeddings , 2018, EMNLP.

[5]  Dong Nguyen,et al.  Word Level Language Identification in Online Multilingual Communication , 2013, EMNLP.

[6]  Manish Shrivastava,et al.  Towards Sub-Word Level Compositions for Sentiment Analysis of Hindi-English Code Mixed Text , 2016, COLING.

[7]  Guillaume Lample,et al.  XNLI: Evaluating Cross-lingual Sentence Representations , 2018, EMNLP.

[8]  Guillaume Lample,et al.  Word Translation Without Parallel Data , 2017, ICLR.

[9]  Yang Liu,et al.  Part-of-Speech Tagging for English-Spanish Code-Switched Text , 2008, EMNLP.

[10]  Claire Cardie,et al.  Adversarial Deep Averaging Networks for Cross-Lingual Sentiment Classification , 2016, TACL.

[11]  Rana D. Parshad,et al.  What is India speaking? Exploring the “Hinglish” invasion , 2016 .

[12]  Xiaojun Wan,et al.  Cross-Lingual Sentiment Classification with Bilingual Document Representation Learning , 2016, ACL.

[13]  Sivaji Bandyopadhyay,et al.  SentiWordNet for Indian Languages , 2010 .

[14]  Meng Zhang,et al.  Earth Mover’s Distance Minimization for Unsupervised Bilingual Lexicon Induction , 2017, EMNLP.

[15]  Pushpak Bhattacharyya,et al.  A Fall-back Strategy for Sentiment Analysis in Hindi: a Case Study , 2010 .

[16]  Preslav Nakov,et al.  SemEval-2016 Task 4: Sentiment Analysis in Twitter. , 2019 .

[17]  Manaal Faruqui,et al.  Cross-lingual Models of Word Embeddings: An Empirical Comparison , 2016, ACL.

[18]  Christopher D. Manning,et al.  Bilingual Word Representations with Monolingual Quality in Mind , 2015, VS@HLT-NAACL.

[19]  Phil Blunsom,et al.  Multilingual Models for Compositional Distributed Semantics , 2014, ACL.

[20]  A G N,et al.  Bibliographical References , 1965 .

[21]  Guillaume Lample,et al.  Cross-lingual Language Model Pretraining , 2019, NeurIPS.

[22]  Johannes Bjerva,et al.  Cross-lingual Learning of Semantic Textual Similarity with Multilingual Word Representations , 2017, NODALIDA.

[23]  Monojit Choudhury,et al.  Word Embeddings for Code-Mixed Language Processing , 2018, EMNLP.

[24]  Jatin Sharma,et al.  POS Tagging of English-Hindi Code-Mixed Social Media Content , 2014, EMNLP.

[25]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[26]  Manaal Faruqui,et al.  Improving Vector Space Word Representations Using Multilingual Correlation , 2014, EACL.

[27]  Els Lefever,et al.  A Classification-Based Approach to Cognate Detection Combining Orthographic and Semantic Similarity Information , 2019, RANLP.

[28]  Monojit Choudhury,et al.  Estimating Code-Switching on Twitter with a Novel Generalized Word-Level Language Detection Technique , 2017, ACL.