Detecting Code-Switching between Turkish-English Language Pair

Code-switching (usage of different languages within a single conversation context in an alternative manner) is a highly increasing phenomenon in social media and colloquial usage which poses different challenges for natural language processing. This paper introduces the first study for the detection of Turkish-English code-switching and also a small test data collected from social media in order to smooth the way for further studies. The proposed system using character level n-grams and conditional random fields (CRFs) obtains 95.6% micro-averaged F1-score on the introduced test data set.

[1]  Rouzbeh A. Shirvani,et al.  Word-Level Language Identification and Predicting Codeswitching Points in Swahili-English Language Data , 2016, CodeSwitch@EMNLP.

[2]  Selection of Correction Candidates for the Normalization of Spanish User Generated Content , 2014 .

[3]  Thomas Eckart,et al.  Building Large Monolingual Dictionaries at the Leipzig Corpora Collection: From 100 to 200 Languages , 2012, LREC.

[4]  GÜLŞEN ERYİǦİT,et al.  Social media text normalization for Turkish , 2017, Natural Language Engineering.

[5]  Amitava Das,et al.  Code-Mixing in Social Media Text. The Last Language Identification Frontier? , 2013, Trait. Autom. des Langues.

[6]  Gülsen Eryigit,et al.  ITU Turkish NLP Web Service , 2014, EACL.

[7]  Wolfgang Maier,et al.  An Arabic-Moroccan Darija Code-Switched Corpus , 2016, LREC.

[8]  C. Myers-Scotton Social Motivations For Codeswitching: Evidence from Africa , 1994 .

[9]  Hakan Yilmazer,et al.  Construction of the Turkish National Corpus (TNC) , 2012, LREC.

[10]  Aravind K. Joshi,et al.  Processing of Sentences With Intra-Sentential Code-Switching , 1982, COLING.

[11]  Dong Nguyen,et al.  Word Level Language Identification in Online Multilingual Communication , 2013, EMNLP.

[12]  Steven Bethard,et al.  Developing Language-tagged Corpora for Code-switching Tweets , 2015, LAW@NAACL-HLT.

[13]  Joachim Wagner,et al.  Code Mixing: A Challenge for Language Identification in the Language of Social Media , 2014, CodeSwitch@EMNLP.

[14]  Julia Hirschberg,et al.  Overview for the First Shared Task on Language Identification in Code-Switched Data , 2014, CodeSwitch@EMNLP.

[15]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[16]  Monojit Choudhury,et al.  Word-level Language Identification using CRF: Code-switching Shared Task Report of MSR India System , 2014, CodeSwitch@EMNLP.

[17]  Thamar Solorio,et al.  Overview for the Second Shared Task on Language Identification in Code-Switched Data , 2014, CodeSwitch@EMNLP.

[18]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[19]  Ngoc Thang Vu,et al.  Challenges of Computational Processing of Code-Switching , 2016, CodeSwitch@EMNLP.

[20]  Timothy Baldwin,et al.  Lexical Normalisation of Short Text Messages: Makn Sens a #twitter , 2011, ACL.

[21]  Elizabeth Shaunessy,et al.  Code Switching among Bilingual and Limited English Proficient Students: Possible Indicators of Giftedness , 2006 .

[22]  Özlem Çetinoglu,et al.  A Turkish-German Code-Switching Corpus , 2016, LREC.