Deep Learning-Based Language Identification in English-Hindi-Bengali Code-Mixed Social Media Corpora

Abstract This article addresses language identification at the word level in Indian social media corpora taken from Facebook, Twitter and WhatsApp posts that exhibit code-mixing between English-Hindi, English-Bengali, as well as a blend of both language pairs. Code-mixing is a fusion of multiple languages previously mainly associated with spoken language, but which social media users also deploy when communicating in ways that tend to be rather casual. The coarse nature of code-mixed social media text makes language identification challenging. Here, the performance of deep learning on this task is compared to feature-based learning, with two Recursive Neural Network techniques, Long Short Term Memory (LSTM) and bidirectional LSTM, being contrasted to a Conditional Random Fields (CRF) classifier. The results show the deep learners outscoring the CRF, with the bidirectional LSTM demonstrating the best language identification performance.

[1]  John C. Paolillo The virtual speech community: social network and language variation on IRC , 1999, Proceedings of the 32nd Annual Hawaii International Conference on Systems Sciences. 1999. HICSS-32. Abstracts and CD-ROM of Full Papers.

[2]  Parth Gupta,et al.  Query expansion for mixed-script information retrieval , 2014, SIGIR.

[3]  David Yarowsky,et al.  Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing , 2013, EMNLP 2013.

[4]  Somnath Banerjee,et al.  Overview of FIRE-2015 Shared Task on Mixed Script Information Retrieval , 2015, FIRE Workshops.

[5]  Jürgen Schmidhuber,et al.  Framewise phoneme classification with bidirectional LSTM and other neural network architectures , 2005, Neural Networks.

[6]  Stig Eliasson Duelling languages. Grammatical structure in code-switching by Carol Myers-Scotton , 1995 .

[7]  Ying Li,et al.  A Mandarin-English Code-Switching Corpus , 2012, LREC.

[8]  Joaquín González-Rodríguez,et al.  Evaluation of an LSTM-RNN System in Different NIST Language Recognition Frameworks , 2016, Odyssey.

[9]  Jatin Sharma,et al.  “I am borrowing ya mixing ?" An Analysis of English-Hindi Code Mixing in Facebook , 2014, CodeSwitch@EMNLP.

[10]  Mark Steedman,et al.  Example Selection for Bootstrapping Statistical Parsers , 2003, NAACL.

[11]  Thamar Solorio,et al.  Overview for the Second Shared Task on Language Identification in Code-Switched Data , 2014, CodeSwitch@EMNLP.

[12]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[13]  Dipankar Das,et al.  Unraveling the English-Bengali Code-Mixing Phenomenon , 2016, CodeSwitch@EMNLP.

[14]  Harsh Jhamtani,et al.  Word-level Language Identification in Bi-lingual Code-switched Texts , 2014, PACLIC.

[15]  Mona T. Diab,et al.  Code Switch Point Detection in Arabic , 2013, NLDB.

[16]  Dong Nguyen,et al.  Word Level Language Identification in Online Multilingual Communication , 2013, EMNLP.

[17]  Clare R. Voss,et al.  Finding Romanized Arabic Dialect in Code-Mixed Tweets , 2014, LREC.

[18]  Aravind K. Joshi,et al.  Processing of Sentences With Intra-Sentential Code-Switching , 1982, COLING.

[19]  M. Gysels French in urban Lubumbashi Swahili: Codeswitching, borrowing, or both? , 1992 .

[20]  Hwee Tou Ng,et al.  Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing , 2008, Conference on Empirical Methods in Natural Language Processing.

[21]  Kuldip K. Paliwal,et al.  Bidirectional recurrent neural networks , 1997, IEEE Trans. Signal Process..

[22]  Julia Hirschberg,et al.  Overview for the First Shared Task on Language Identification in Code-Switched Data , 2014, CodeSwitch@EMNLP.

[23]  Andrew McCallum,et al.  Confidence Estimation for Information Extraction , 2004, NAACL.

[24]  Fei Xia,et al.  Language ID in the Context of Harvesting Language Data off the Web , 2009, EACL.

[25]  Li Wang,et al.  How Noisy Social Media Text, How Diffrnt Social Media Sources? , 2013, IJCNLP.

[26]  Subbarao Kambhampati,et al.  Dude, srsly?: The Surprisingly Formal Nature of Twitter's Language , 2013, ICWSM.

[27]  Maguelonne Teisseire,et al.  19th International Conference on Applications of Natural Language to Information Systems , 2014 .

[28]  Fernando Pereira,et al.  Shallow Parsing with Conditional Random Fields , 2003, NAACL.

[29]  Timothy Baldwin,et al.  langid.py: An Off-the-shelf Language Identification Tool , 2012, ACL.

[30]  Rada Mihalcea,et al.  Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Langu , 2011, ACL 2011.

[31]  Klaus-Robert Müller,et al.  Language Detection For Short Text Messages In Social Media , 2016, ArXiv.

[32]  John C. Paolillo Language Choice on soc.culture.punjab. , 1996 .

[33]  Neny Isharyanti,et al.  Code-switching and code-mixing in Internet chatting: between 'yes', 'ya', and 'si'-a case study , 2009 .

[34]  Chng Eng Siong,et al.  Mandarin–English code-switching speech corpus in South-East Asia: SEAME , 2015, Lang. Resour. Evaluation.

[35]  Mona T. Diab,et al.  Feasibility of Leveraging Crowd Sourcing for the Creation of a Large Scale Annotated Resource for Hindi English Code Switched Data: A Pilot Annotation , 2011, ALR@IJCNLP.

[36]  Gokul Chittaranjan,et al.  Overview of FIRE 2014 Track on Transliterated Search , 2014 .

[37]  Yang Liu,et al.  Part-of-Speech Tagging for English-Spanish Code-Switched Text , 2008, EMNLP.

[38]  Nicoletta Calzolari,et al.  Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC-2014) , 2014, LREC 2014.

[39]  Susan C. Herring Media and Language Change: Introduction , 2003 .

[40]  Chung-Hsien Wu,et al.  Proceedings of the 13th Pacific Asia Conference on Language, Information and Computation, Taipei, Taiwan, February 10-11, 1999 , 1999, PACLIC.

[41]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[42]  Min Zhang Proceedings of the ACL 2012 System Demonstrations , 2012 .

[43]  Pieter Muysken,et al.  Bilingual Speech: A Typology of Code-Mixing , 2000 .

[44]  J. Gumperz Discourse strategies: Introduction , 1982 .

[45]  Alex Graves,et al.  Supervised Sequence Labelling with Recurrent Neural Networks , 2012, Studies in Computational Intelligence.

[47]  Andrew McCallum,et al.  An Introduction to Conditional Random Fields , 2010, Found. Trends Mach. Learn..

[48]  J. Gonzalez-Dominguez,et al.  Language Identification in Short Utterances Using Long Short-Term Memory (LSTM) Recurrent Neural Networks , 2016, PloS one.

[49]  Grzegorz Chrupala,et al.  DCU-UVT: Word-Level Language Classification with Code-Mixed Data , 2014, CodeSwitch@EMNLP.

[50]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[51]  Zheng Huang,et al.  Bi-directional LSTM Recurrent Neural Network for Chinese Word Segmentation , 2016, ICONIP.

[52]  Mykola Pechenizkiy,et al.  Graph-Based N-gram Language Identication on Short Texts , 2011 .

[53]  Mike Rosner,et al.  A tagging algorithm for mixed language identification in a noisy domain , 2007, INTERSPEECH.

[54]  Brendan T. O'Connor,et al.  Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics , 2011 .

[55]  Amitava Das,et al.  Code-Mixing in Social Media Text. The Last Language Identification Frontier? , 2013, Trait. Autom. des Langues.