论文信息 - Sentiment Analysis for Hinglish Code-mixed Tweets by means of Cross-lingual Word Embeddings - 字舞流文

Sentiment Analysis for Hinglish Code-mixed Tweets by means of Cross-lingual Word Embeddings

This paper investigates the use of unsupervised cross-lingual embeddings for solving the problem of code-mixed social media text understanding. We specifically investigate the use of these embeddings for a sentiment analysis task for Hinglish Tweets, viz. English combined with (transliterated) Hindi. In a first step, baseline models, initialized with monolingual embeddings obtained from large collections of tweets in English and code-mixed Hinglish, were trained. In a second step, two systems using cross-lingual embeddings were researched, being (1) a supervised classifier and (2) a transfer learning approach trained on English sentiment data and evaluated on code-mixed data. We demonstrate that incorporating cross-lingual embeddings improves the results (F1-score of 0.635 versus a monolingual baseline of 0.616), without any parallel data required to train the cross-lingual embeddings. In addition, the results show that the cross-lingual embeddings not only improve the results in a fully supervised setting, but they can also be used as a base for distant supervision, by training a sentiment model in one of the source languages and evaluating on the other language projected in the same space. The transfer learning experiments result in an F1-score of 0.556, which is almost on par with the supervised settings and speak to the robustness of the cross-lingual embeddings approach.

Els Lefever | Pranaydeep Singh | Els Lefever | Pranaydeep Singh

[1] Dipti Misra Sharma,et al. Shallow Parsing Pipeline - Hindi-English Code-Mixed Social Media Text , 2016, NAACL.

[2] Eneko Agirre,et al. A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings , 2018, ACL.

[3] Vasudeva Varma,et al. Hindi Subjective Lexicon: A Lexical Resource for Hindi Adjective Polarity Classification , 2012, LREC.

[4] Claire Cardie,et al. Unsupervised Multilingual Word Embeddings , 2018, EMNLP.

[5] Dong Nguyen,et al. Word Level Language Identification in Online Multilingual Communication , 2013, EMNLP.

[6] Manish Shrivastava,et al. Towards Sub-Word Level Compositions for Sentiment Analysis of Hindi-English Code Mixed Text , 2016, COLING.

[7] Guillaume Lample,et al. XNLI: Evaluating Cross-lingual Sentence Representations , 2018, EMNLP.

[8] Guillaume Lample,et al. Word Translation Without Parallel Data , 2017, ICLR.

[9] Yang Liu,et al. Part-of-Speech Tagging for English-Spanish Code-Switched Text , 2008, EMNLP.

[10] Claire Cardie,et al. Adversarial Deep Averaging Networks for Cross-Lingual Sentiment Classification , 2016, TACL.

[11] Rana D. Parshad,et al. What is India speaking? Exploring the “Hinglish” invasion , 2016 .

[12] Xiaojun Wan,et al. Cross-Lingual Sentiment Classification with Bilingual Document Representation Learning , 2016, ACL.

[13] Sivaji Bandyopadhyay,et al. SentiWordNet for Indian Languages , 2010 .

[14] Meng Zhang,et al. Earth Mover’s Distance Minimization for Unsupervised Bilingual Lexicon Induction , 2017, EMNLP.

[15] Pushpak Bhattacharyya,et al. A Fall-back Strategy for Sentiment Analysis in Hindi: a Case Study , 2010 .

[16] Preslav Nakov,et al. SemEval-2016 Task 4: Sentiment Analysis in Twitter. , 2019 .

[17] Manaal Faruqui,et al. Cross-lingual Models of Word Embeddings: An Empirical Comparison , 2016, ACL.

[18] Christopher D. Manning,et al. Bilingual Word Representations with Monolingual Quality in Mind , 2015, VS@HLT-NAACL.

[19] Phil Blunsom,et al. Multilingual Models for Compositional Distributed Semantics , 2014, ACL.

[20] A G N,et al. Bibliographical References , 1965 .

[21] Guillaume Lample,et al. Cross-lingual Language Model Pretraining , 2019, NeurIPS.

[22] Johannes Bjerva,et al. Cross-lingual Learning of Semantic Textual Similarity with Multilingual Word Representations , 2017, NODALIDA.

[23] Monojit Choudhury,et al. Word Embeddings for Code-Mixed Language Processing , 2018, EMNLP.

[24] Jatin Sharma,et al. POS Tagging of English-Hindi Code-Mixed Social Media Content , 2014, EMNLP.

[25] Tomas Mikolov,et al. Enriching Word Vectors with Subword Information , 2016, TACL.

[26] Manaal Faruqui,et al. Improving Vector Space Word Representations Using Multilingual Correlation , 2014, EACL.

[27] Els Lefever,et al. A Classification-Based Approach to Cognate Detection Combining Orthographic and Semantic Similarity Information , 2019, RANLP.

[28] Monojit Choudhury,et al. Estimating Code-Switching on Twitter with a Novel Generalized Word-Level Language Detection Technique , 2017, ACL.