POS Tagging of English-Hindi Code-Mixed Social Media Content

Code-mixing is frequently observed in user generated content on social media, especially from multilingual users. The linguistic complexity of such content is compounded by presence of spelling variations, transliteration and non-adherance to formal grammar. We describe our initial efforts to create a multi-level annotated corpus of Hindi-English codemixed text collated from Facebook forums, and explore language identification, back-transliteration, normalization and POS tagging of this data. Our results show that language identification and transliteration for Hindi are two major challenges that impact POS tagging accuracy.

[1]  Yang Liu,et al.  Learning to Predict Code-Switching Points , 2008, EMNLP.

[2]  Ben King,et al.  Labeling the Languages of Words in Mixed-Language Documents using Weakly Supervised Methods , 2013, NAACL.

[3]  Peter Auer,et al.  One speaker, two languages: The pragmatics of code-switching: a sequential approach , 1995 .

[4]  Jean Aitchison,et al.  Language and the Internet , 2002, Lit. Linguistic Comput..

[5]  Tirthankar Dasgupta,et al.  Resource Creation for Training and Testing of Transliteration Systems for Indian Languages , 2010, LREC.

[6]  Parth Gupta,et al.  Query expansion for mixed-script information retrieval , 2014, SIGIR.

[7]  Wouter Weerkamp,et al.  Microblog language identification: overcoming the limitations of short, unedited and idiomatic text , 2012, Language Resources and Evaluation.

[8]  Neny Isharyanti,et al.  Code-switching and code-mixing in Internet chatting: between 'yes', 'ya', and 'si'-a case study , 2009 .

[9]  Slav Petrov,et al.  A Universal Part-of-Speech Tagset , 2011, LREC.

[10]  Susan C. Herring,et al.  The Multilingual Internet: Language, Culture, and Communication Online , 2007 .

[11]  Brendan T. O'Connor,et al.  Part-of-Speech Tagging for Twitter: Annotation, Features, and Experiments , 2010, ACL.

[12]  Dan Klein,et al.  Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network , 2003, NAACL.

[13]  Marc Najork,et al.  Boot-Strapping Language Identifiers for Short Colloquial Postings , 2013, ECML/PKDD.

[14]  Prasenjit Majumder,et al.  Overview of the FIRE 2013 Track on Transliterated Search , 2013, FIRE.

[15]  Monojit Choudhury,et al.  Mining Hindi-English Transliteration Pairs from Online Hindi Lyrics , 2012, LREC.

[16]  Yang Liu,et al.  Part-of-Speech Tagging for English-Spanish Code-Switched Text , 2008, EMNLP.

[17]  Jatin Sharma,et al.  “I am borrowing ya mixing ?" An Analysis of English-Hindi Code Mixing in Facebook , 2014, CodeSwitch@EMNLP.

[18]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[19]  Brendan T. O'Connor,et al.  Improved Part-of-Speech Tagging for Online Conversational Text with Word Clusters , 2013, NAACL.

[20]  Rishiraj Saha Roy,et al.  Overview and Datasets of FIRE 2013 Track on Transliterated Search , 2013 .

[21]  Jatin Sharma,et al.  Query word labeling and Back Transliteration for Indian Languages: Shared task system description , 2013 .

[22]  Animesh Mukherjee,et al.  Investigation and modeling of the structure of texting language , 2007, International Journal of Document Analysis and Recognition (IJDAR).

[23]  Dong Nguyen,et al.  Word Level Language Identification in Online Multilingual Communication , 2013, EMNLP.

[24]  Pushpak Bhattacharyya,et al.  A Common Parts-of-Speech Tagset Framework for Indian Languages , 2008, LREC.