POS Tagging of Hindi-English Code Mixed Text from Social Media: Some Machine Learning Experiments

We discuss Part-of-Speech(POS) tagging of Hindi-English Code-Mixed(CM) text from social media content. We propose extensions to the existing approaches, we also present a new feature set which addresses the transliteration problem inherent in social media. We achieve an 84% accuracy with the new feature set. We show that the context and joint modelling of language detection and POS tag layers do not help in POS tagging.

[1]  Jatin Sharma,et al.  “I am borrowing ya mixing ?" An Analysis of English-Hindi Code Mixing in Facebook , 2014, CodeSwitch@EMNLP.

[2]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[3]  Yang Liu,et al.  Learning to Predict Code-Switching Points , 2008, EMNLP.

[4]  Jatin Sharma,et al.  Query word labeling and Back Transliteration for Indian Languages: Shared task system description , 2013 .

[5]  Susan C. Herring,et al.  The Multilingual Internet: Language, Culture, and Communication Online , 2007 .

[6]  Brendan T. O'Connor,et al.  Part-of-Speech Tagging for Twitter: Annotation, Features, and Experiments , 2010, ACL.

[7]  Brendan T. O'Connor,et al.  Improved Part-of-Speech Tagging for Online Conversational Text with Word Clusters , 2013, NAACL.

[8]  Julia Hirschberg,et al.  Overview for the First Shared Task on Language Identification in Code-Switched Data , 2014, CodeSwitch@EMNLP.

[9]  Jannis Androutsopoulos Media and language change , 2017 .

[10]  Pushpak Bhattacharyya,et al.  A Common Parts-of-Speech Tagset Framework for Indian Languages , 2008, LREC.

[11]  Dan Klein,et al.  Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network , 2003, NAACL.

[12]  Amitava Das,et al.  Part-of-Speech Tagging for Code-Mixed English-Hindi Twitter and Facebook Chat Messages , 2015, RANLP.

[13]  Rishiraj Saha Roy,et al.  Overview and Datasets of FIRE 2013 Track on Transliterated Search , 2013 .

[14]  Yang Liu,et al.  Part-of-Speech Tagging for English-Spanish Code-Switched Text , 2008, EMNLP.

[15]  Tirthankar Dasgupta,et al.  Resource Creation for Training and Testing of Transliteration Systems for Indian Languages , 2010, LREC.

[16]  Jatin Sharma,et al.  POS Tagging of English-Hindi Code-Mixed Social Media Content , 2014, EMNLP.

[17]  David Crystal,et al.  Language and the Internet , 2001 .

[18]  Anupam Jamatia Part-of-Speech Tagging System for Indian Social Media Text on Twitter , 2014 .

[19]  Monojit Choudhury,et al.  Word-level Language Identification using CRF: Code-switching Shared Task Report of MSR India System , 2014, CodeSwitch@EMNLP.

[20]  Ben King,et al.  Labeling the Languages of Words in Mixed-Language Documents using Weakly Supervised Methods , 2013, NAACL.

[21]  Neny Isharyanti,et al.  Code-switching and code-mixing in Internet chatting: between 'yes', 'ya', and 'si'-a case study , 2009 .

[22]  Slav Petrov,et al.  A Universal Part-of-Speech Tagset , 2011, LREC.