Comparing the Level of Code-Switching in Corpora

Social media texts are often fairly informal and conversational, and when produced by bilinguals tend to be written in several different languages simultaneously, in the same way as conversational speech. The recent availability of large social media corpora has thus also made large-scale code-switched resources available for research. The paper addresses the issues of evaluation and comparison these new corpora entail, by defining an objective measure of corpus level complexity of code-switched texts. It is also shown how this formal measure can be used in practice, by applying it to several code-switched corpora.

[1]  Paolo Rosso,et al.  A Self-enriching Methodology for Clustering Narrow Domain Short Texts , 2011, Comput. J..

[2]  R. Flesch A new readability yardstick. , 1948, The Journal of applied psychology.

[3]  Robert T. Clemen,et al.  Comment on Cooke's classical method , 2008, Reliab. Eng. Syst. Saf..

[4]  Li Wang,et al.  How Noisy Social Media Text, How Diffrnt Social Media Sources? , 2013, IJCNLP.

[5]  Peter Auer,et al.  From codeswitching via language mixing to fused lects , 1999 .

[6]  Almeida Jacqueline Toribio Bilingual speech: A typology of codemixing: Pieter Muysken , 2002 .

[7]  Björn Gambäck On Measuring the Complexity of Code-Mixing , 2014 .

[8]  John C. Paolillo "Conversational" Codeswitching on Usenet and Internet Relay Chat , 2011 .

[9]  Simon Carter,et al.  Exploration and exploitation of multilingual data for statistical machine translation , 2012 .

[10]  R. Gunning The Technique of Clear Writing. , 1968 .

[11]  John C. Paolillo The virtual speech community: social network and language variation on IRC , 1999, Proceedings of the 32nd Annual Hawaii International Conference on Systems Sciences. 1999. HICSS-32. Abstracts and CD-ROM of Full Papers.

[12]  Lars Hinrichs,et al.  World Englishes, Code-Switching, and Convergence , 2017 .

[13]  Subbarao Kambhampati,et al.  Dude, srsly?: The Surprisingly Formal Nature of Twitter's Language , 2013, ICWSM.

[14]  Julia Hirschberg,et al.  Overview for the First Shared Task on Language Identification in Code-Switched Data , 2014, CodeSwitch@EMNLP.

[15]  Stephen Pax Leonard,et al.  Language change and digital media: A review of conceptions and evidence , 2011 .

[16]  Fabrizio Sebastiani,et al.  An Analysis of the Relative Hardness of Reuters-21578 Subsets , 2003 .

[17]  J. Gafaranga,et al.  Interactional otherness: Towards a redefinition of codeswitching , 2002 .

[18]  C. Baker,et al.  Translanguaging: origins and development from school to street and beyond , 2012 .

[19]  Amitava Das,et al.  Part-of-Speech Tagging for Code-Mixed English-Hindi Twitter and Facebook Chat Messages , 2015, RANLP.

[20]  Jacob Eisenstein,et al.  What to do about bad language on the internet , 2013, NAACL.

[21]  Clare R. Voss,et al.  Finding Romanized Arabic Dialect in Code-Mixed Tweets , 2014, LREC.

[22]  Christian Genest,et al.  Allocating the weights in the linear opinion pool , 1990 .

[23]  John C. Paolillo Language Choice on soc.culture.punjab. , 1996 .

[24]  Neny Isharyanti,et al.  Code-switching and code-mixing in Internet chatting: between 'yes', 'ya', and 'si'-a case study , 2009 .

[25]  Chng Eng Siong,et al.  Mandarin–English code-switching speech corpus in South-East Asia: SEAME , 2015, Lang. Resour. Evaluation.

[26]  A. Kilgarriff Comparing Corpora , 2001 .

[27]  Amitava Das,et al.  Identifying Languages at the Word Level in Code-Mixed Indian Social Media Text , 2014, ICON.

[28]  Timothy Baldwin,et al.  Accurate Language Identification of Twitter Messages , 2014 .

[29]  Dong Nguyen,et al.  Word Level Language Identification in Online Multilingual Communication , 2013, EMNLP.