A Dataset for Detecting Irony in Hindi-English Code-Mixed Social Media Text

Irony is one of many forms of figurative languages. Irony detection is crucial for Natural Language Processing (NLP) tasks like sentiment analysis and opinion mining. From cognitive point of view, it is a challenge to study how human use irony as a communication tool. While relevant research has been done independently on code-mixed social media texts and irony detection, our work is the first attempt in detecting irony in Hindi-English code-mixed social media text. In this paper, we study the problem of automatic irony detection as a classification problem and present a Hindi-English code-mixed dataset consisting of tweets posted online on Twitter. The tweets are annotated with the language at word level and the class they belong to (Ironic or Non-Ironic). We also propose a supervised classification system for detecting irony in the text using various character level, word level, and structural features.

[1]  Jatin Sharma,et al.  “I am borrowing ya mixing ?" An Analysis of English-Hindi Code Mixing in Facebook , 2014, CodeSwitch@EMNLP.

[2]  Luisa Duran,et al.  Toward a Better Understanding of Code-Switching and Interlanguage in Bilinguality: Implications for Bilingual Instruction. , 1994 .

[3]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[4]  W. B. Cavnar,et al.  N-gram-based text categorization , 1994 .

[5]  Elena Filatova,et al.  Irony and Sarcasm: Corpus Generation and Analysis Using Crowdsourcing , 2012, LREC.

[6]  Pieter Muysken,et al.  Bilingual Speech: A Typology of Code-Mixing , 2000 .

[7]  Dipti Misra Sharma,et al.  Shallow Parsing Pipeline - Hindi-English Code-Mixed Social Media Text , 2016, NAACL.

[8]  Erik Forslid,et al.  Automatic irony- and sarcasm detection in Social media , 2015 .

[9]  Paolo Rosso,et al.  A multidimensional approach for detecting irony in Twitter , 2013, Lang. Resour. Evaluation.

[10]  Dipankar Das,et al.  Sentiment Identification in Code-Mixed Social Media Text , 2017, ArXiv.

[11]  Manish Shrivastava,et al.  Towards Sub-Word Level Compositions for Sentiment Analysis of Hindi-English Code Mixed Text , 2016, COLING.

[12]  Carol Myers-Scotton,et al.  Duelling Languages: Grammatical Structure in Codeswitching , 1993 .

[13]  Joachim Wagner,et al.  Code Mixing: A Challenge for Language Identification in the Language of Social Media , 2014, CodeSwitch@EMNLP.

[14]  Jatin Sharma,et al.  POS Tagging of English-Hindi Code-Mixed Social Media Content , 2014, EMNLP.

[15]  Manoj Kumar Chinnakotla,et al.  "Answer ka type kya he?": Learning to Classify Questions in Code-Mixed Language , 2015, WWW.

[16]  Stephen Huffman Acquaintance: Language-Independent Document Categorization by N-Grams , 1995, TREC.

[17]  Tony Veale,et al.  Detecting Ironic Intent in Creative Comparisons , 2010, ECAI.

[18]  Nello Cristianini,et al.  Classification using String Kernels , 2000 .

[19]  Luigi Di Caro,et al.  Annotating Irony in a Novel Italian Corpus for Sentiment Analysis , 2012 .