Identifying Languages at the Word Level in Code-Mixed Indian Social Media Text

Language identification at the document level has been considered an almost solved problem in some application areas, but language detectors fail in the social media context due to phenomena such as utterance internal code-switching, lexical borrowings, and phonetic typing; all implying that language identification in social media has to be carried out at the word level. The paper reports a study to detect language boundaries at the word level in chat message corpora in mixed EnglishBengali and English-Hindi. We introduce a code-mixing index to evaluate the level of blending in the corpora and describe the performance of a system developed to separate multiple languages.

[1]  Radim Rehurek,et al.  Language Identification on the Web: Extending the Dictionary Method , 2009, CICLing.

[2]  William John Teahan,et al.  Text classification and segmentation using minimum cross-entropy , 2000, RIAO.

[3]  Timothy Baldwin,et al.  Accurate Language Identification of Twitter Messages , 2014 .

[4]  Paul McNamee,et al.  Language identification: a solved problem suitable for undergraduate instruction , 2005 .

[5]  Pieter Muysken,et al.  Bilingual Speech: A Typology of Code-Mixing , 2000 .

[6]  Suzanne Romaine One Speaker, Two Languages: Cross-Disciplinary Perspectives on Code-Switching , 1997 .

[7]  Thorsten Joachims,et al.  Making large-scale support vector machine learning practical , 1999 .

[8]  Timothy Baldwin,et al.  Automatically Constructing a Normalisation Dictionary for Microblogs , 2012, EMNLP.

[9]  S. C. Sen Gupta,et al.  Samsad Bengali-English dictionary , 1995 .

[10]  Jeff MacSwan,et al.  Code Switching and Grammatical Theory , 2008 .

[11]  Rosalyn Negrón Goldbarg Spanish-English Codeswitching in Email Communication , 2009 .

[12]  Yang Liu,et al.  Analyzing language samples of Spanish-English bilingual children for the automated prediction of language dominance , 2011, Nat. Lang. Eng..

[13]  Latisha Asmaak Shafie,et al.  Languages, Code-Switching Practice and Primary Functions of Facebook among University Students , 2013 .

[14]  Paul Rodrigues,et al.  Processing highly variant language using incremental model selection , 2012 .

[15]  Jagadeesh Gorla,et al.  Identification of Languages and Encodings in a Multilingual Document , 2007 .

[16]  H. San Chinese-English Code-switching in Blogs by Macao Young People , 2009 .

[17]  Clare R. Voss,et al.  Finding Romanized Arabic Dialect in Code-Mixed Tweets , 2014, LREC.

[18]  Aravind K. Joshi,et al.  Processing of Sentences With Intra-Sentential Code-Switching , 1982, COLING.

[19]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[20]  Yang Liu,et al.  Part-of-Speech Tagging for English-Spanish Code-Switched Text , 2008, EMNLP.

[21]  John M. Prager,et al.  Linguini: language identification for multilingual documents , 1999, Proceedings of the 32nd Annual Hawaii International Conference on Systems Sciences. 1999. HICSS-32. Abstracts and CD-ROM of Full Papers.

[22]  Mike Rosner,et al.  A tagging algorithm for mixed language identification in a noisy domain , 2007, INTERSPEECH.

[23]  M Damashek,et al.  Gauging Similarity with n-Grams: Language-Independent Categorization of Text , 1995, Science.

[24]  Kenneth R. Beesley,et al.  Language Identifier: A Computer Program for Automatic Natural-Language Identification of On-line Tex , 1988 .

[25]  Haizhou Li,et al.  Integration of language identification into a recognition system for spoken conversations containing code-Switches , 2012, SLTU.

[26]  Zannie Bock,et al.  Cyber socialising: Emerging genres and registers of intimacy among young South African students , 2013 .

[27]  Hiroshi Yamaguchi,et al.  Text Segmentation by Language Using Minimum Description Length , 2012, ACL.

[28]  Timothy Baldwin,et al.  Automatic Detection and Language Identification of Multilingual Documents , 2014, TACL.

[29]  J. Gafaranga,et al.  Interactional otherness: Towards a redefinition of codeswitching , 2002 .

[30]  E. Mark Gold,et al.  Language Identification in the Limit , 1967, Inf. Control..

[31]  Na Na SAMSAD BENGALI-ENGLISH DICTIONARY , 1958 .

[32]  Tan Lee,et al.  Automatic Recognition of Cantonese-English Code-Mixing Speech , 2009, ROCLING/IJCLCLP.

[33]  Beatrice Alex,et al.  Automatic detection of English inclusions in mixed-lingual text with an application to parsing , 2008 .

[34]  Dong Nguyen,et al.  Word Level Language Identification in Online Multilingual Communication , 2013, EMNLP.

[35]  Joachim Wagner,et al.  Code Mixing: A Challenge for Language Identification in the Language of Social Media , 2014, CodeSwitch@EMNLP.

[36]  Mitesh M. Khapra,et al.  Offering language based services on social media by identifying user's preferred language(s) from romanized text , 2013, WWW.

[37]  Simon Carter,et al.  Exploration and exploitation of multilingual data for statistical machine translation , 2012 .

[38]  Aravind K. Joshi,et al.  Natural language parsing: Processing of sentences with intrasentential code switching , 1985 .

[39]  Ted E. Dunning,et al.  Statistical Identification of Language , 1994 .

[40]  Nanyun Peng,et al.  Learning Polylingual Topic Models from Code-Switched Social Media Documents , 2014, ACL.

[41]  Sivaji Bandyopadhyay,et al.  English to Indian Languages Machine Transliteration System at NEWS 2010 , 2010, NEWS@ACL.

[42]  Thomas Gottron,et al.  A Comparison of Language Identification Approaches on Short, Query-Style Texts , 2010, ECIR.

[43]  Timothy Baldwin,et al.  Language Identification: The Long and the Short of the Matter , 2010, NAACL.

[44]  Gisle Andersen Chapter 5. Semi-automatic approaches to Anglicism detection in Norwegian corpus data , 2012 .

[45]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[46]  Taofik Hidayat AN ANALYSIS OF CODE SWITCHING USED BY FACEBOOKERS (a Case Study in a Social Network Site) , 2012 .

[47]  Charles C. Tappert,et al.  Detection of foreign words and names in written text , 2005 .

[48]  Ben King,et al.  Labeling the Languages of Words in Mixed-Language Documents using Weakly Supervised Methods , 2013, NAACL.

[49]  David C. S. Li Cantonese‐English code‐switching research in Hong Kong: a Y2K review , 2000 .