A Survey of Code-switched Speech and Language Processing

Code-switching, the alternation of languages within a conversation or utterance, is a common communicative phenomenon that occurs in multilingual communities across the world. This survey reviews computational approaches for code-switched Speech and Natural Language Processing. We motivate why processing code-switched text and speech is essential for building intelligent agents and systems that interact with users in multilingual communities. As code-switching data and resources are scarce, we list what is available in various code-switched language pairs with the language processing tasks they can be used for. We review code-switching research in various Speech and NLP applications, including language processing tools and end-to-end systems. We conclude with future directions and open problems in the field.

[1]  Monojit Choudhury,et al.  Grammatical Constraints on Intra-sentential Code-Switching: From Theories to Working Models , 2016, ArXiv.

[2]  Claudia Barolo,et al.  Language independent phoneme mapping for foreign TTS , 2004, SSW.

[3]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[4]  N. Smith,et al.  Pidgins and Creoles. An Introduction , 1994 .

[5]  Amitava Das,et al.  Comparing the Level of Code-Switching in Corpora , 2016, LREC.

[6]  Dominique Estival,et al.  Multilingual Semantic Parsing And Code-Switching , 2017, CoNLL.

[7]  Eva Eppler,et al.  The LIDES Coding Manual , 2000 .

[8]  Thierry Poibeau,et al.  Dependency Parsing of Code-Switching Data with Cross-Lingual Feature Representations , 2018 .

[9]  Manish Shrivastava,et al.  Joining Hands: Exploiting Monolingual Treebanks for Parsing of Code-mixing Data , 2017, EACL.

[10]  Haizhou Li,et al.  Recurrent neural network language modeling for code switching conversational speech , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[11]  Alan W. Black,et al.  Experiments with Cross-lingual Systems for Synthesis of Code-Mixed Text , 2016, SSW.

[12]  A. Camilleri Language Values and Identities: Code Switching in Secondary Classrooms in Malta. , 1996 .

[13]  Pascale Fung,et al.  A Hindi-English Code-Switching Corpus , 2014, LREC.

[14]  Mitchell Peabody,et al.  Methods for pronunciation assessment in computer aided language learning , 2011 .

[15]  Ryan Cotterell,et al.  Nerit: Named Entity Recognition for Informal Text , 2013 .

[16]  Suryakanth V. Gangashetty,et al.  Adapting monolingual resources for code-mixed hindi-english speech recognition , 2017, 2017 International Conference on Asian Language Processing (IALP).

[17]  Tan Lee,et al.  Detection of language boundary in code-switching utterances by bi-phone probabilities , 2004, 2004 International Symposium on Chinese Spoken Language Processing.

[18]  Thomas Niesler,et al.  Building a Unified Code-Switching ASR System for South African Languages , 2018, INTERSPEECH.

[19]  Çagri Çöltekin,et al.  Part of Speech Annotation of a Turkish-German Code-Switching Corpus , 2016, LAW@ACL.

[20]  Ngoc Thang Vu,et al.  Features for factored language models for code-Switching speech , 2014, SLTU.

[21]  Tan Lee,et al.  Development of a Cantonese-English code-mixing speech corpus , 2005, INTERSPEECH.

[22]  P. Shukla,et al.  A bilingual parser for Hindi , English and code-switching structures , 2022 .

[23]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[24]  J. Gumperz Discourse strategies: Introduction , 1982 .

[25]  Pieter Muysken,et al.  Government and code-mixing , 1986, Journal of Linguistics.

[26]  Arun Baby,et al.  Resources for Indian languages , 2016 .

[27]  Manoj Kumar Chinnakotla,et al.  "Answer ka type kya he?": Learning to Classify Questions in Code-Mixed Language , 2015, WWW.

[28]  Thamar Solorio,et al.  A Multi-task Approach for Named Entity Recognition in Social Media Data , 2017, NUT@EMNLP.

[29]  Andreas Stolcke,et al.  A study of multilingual speech recognition , 1997, EUROSPEECH.

[30]  Ying Li,et al.  Code switch language modeling with Functional Head Constraint , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[31]  Monojit Choudhury,et al.  Quantitative Characterization of Code Switching Patterns in Complex Multi-Party Conversations: A Case Study on Hindi Movie Scripts , 2017, ICON.

[32]  Sunil Kumar Kopparapu,et al.  Mixed Language Speech Recognition without Explicit Identification of Language , 2012 .

[33]  Almeida Jacqueline Toribio,et al.  Code switching and X-bar theory: the fuctional head constraint , 1994 .

[34]  Ursula Lanvers,et al.  Language alternation in infant bilinguals: A developmental approach to codeswitching , 2001 .

[35]  Thamar Solorio,et al.  Baby-Steps Towards Building a Spanglish Language Model , 2009, CICLing.

[36]  Shana Poplack,et al.  Sometimes I’ll start a sentence in Spanish Y TERMINO EN ESPAÑOL: toward a typology of code-switching1 , 1980 .

[37]  Sai Krishna Rallabandi,et al.  IIIT Hyderabad’s submission to the Blizzard Challenge 2015 , 2015 .

[38]  Monojit Choudhury,et al.  Accommodation of Conversational Code-Choice , 2018, CodeSwitch@ACL.

[39]  Dipti Misra Sharma,et al.  Shallow Parsing Pipeline - Hindi-English Code-Mixed Social Media Text , 2016, NAACL.

[40]  Dau-Cheng Lyu,et al.  Speech Recognition on Code-Switching Among the Chinese Dialects , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[41]  Alan W. Black,et al.  On Building Mixed Lingual Speech Synthesis Systems , 2017, INTERSPEECH.

[42]  Monojit Choudhury,et al.  Word-level Language Identification using CRF: Code-switching Shared Task Report of MSR India System , 2014, CodeSwitch@EMNLP.

[43]  Aravind K. Joshi,et al.  Processing of Sentences With Intra-Sentential Code-Switching , 1982, COLING.

[44]  Ben King,et al.  Labeling the Languages of Words in Mixed-Language Documents using Weakly Supervised Methods , 2013, NAACL.

[45]  Tetyana Lyudovyk,et al.  Code-Switching speech recognition for closely related languages , 2014, SLTU.

[46]  Sadaoki Furui,et al.  Polyglot synthesis using a mixture of monolingual corpora , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[47]  M. Heller Language and social identity: Negotiations of language choice in Montreal , 1983 .

[48]  Mark Sebba A Congruence Approach to the Syntax of Codeswitching , 1998 .

[49]  Qiang Wang,et al.  Codeswitching in the primary EFL classroom in China – Two case studies , 2009 .

[50]  Penelope Gardner-Chloros,et al.  Assumptions Behind Grammatical Approaches To Code-Switching: When The Blueprint Is A Red Herring , 2004 .

[51]  David A. van Leeuwen,et al.  Acoustic and Textual Data Augmentation for Improved ASR of Code-Switching Speech , 2018, INTERSPEECH.

[52]  Alan W. Black,et al.  Tackling Code-Switched NER: Participation of CMU , 2018, CodeSwitch@ACL.

[53]  Preeti Rao,et al.  A Study of Lexical and Prosodic Cues to Segmentation in a Hindi-English Code-switched Discourse , 2018, INTERSPEECH.

[54]  Amitava Das,et al.  CMIR: A Corpus for Evaluation of Code Mixed Information Retrieval of Hindi-English Tweets , 2016, Computación y Sistemas.

[55]  Mona T. Diab,et al.  Named Entity Recognition for Arabic Social Media , 2015, VS@HLT-NAACL.

[56]  Kishore Prahallad,et al.  Is word-to-phone mapping better than phone-phone mapping for handling English words? , 2013, ACL.

[57]  Josef van Genabith,et al.  Code-Mixed Question Answering Challenge: Crowd-sourcing Data and Techniques , 2018, CodeSwitch@ACL.

[58]  Slim Abdennadher,et al.  Collection and Analysis of Code-switch Egyptian Arabic-English Speech Corpus , 2018, LREC.

[59]  Jatin Sharma,et al.  “I am borrowing ya mixing ?" An Analysis of English-Hindi Code Mixing in Facebook , 2014, CodeSwitch@EMNLP.

[60]  Monojit Choudhury,et al.  Language Modeling for Code-Mixing: The Role of Linguistic Theory based Synthetic Data , 2018, ACL.

[61]  Joachim Wagner,et al.  Code Mixing: A Challenge for Language Identification in the Language of Social Media , 2014, CodeSwitch@EMNLP.

[62]  Tien Ping Tan,et al.  Automatic Speech Recognition of Code Switching Speech Using 1-Best Rescoring , 2012, 2012 International Conference on Asian Language Processing.

[63]  Mitesh M. Khapra,et al.  A Dataset for Building Code-Mixed Goal Oriented Conversation Systems , 2018, COLING.

[64]  Sivaji Bandyopadhyay,et al.  Dialogue based Question Answering System in Telugu , 2006 .

[65]  Amitava Das,et al.  Collecting and Annotating Indian Social Media Code-Mixed Corpora , 2016, CICLing.

[66]  N. Poulisse,et al.  Duelling Languages: Grammatical Structure in Codeswitching , 1998 .

[67]  Haizhou Li,et al.  SEAME: a Mandarin-English code-switching speech corpus in south-east asia , 2010, INTERSPEECH.

[68]  Vinay Singh,et al.  Named Entity Recognition for Hindi-English Code-Mixed Social Media Text , 2018, NEWS@ACL.

[69]  Günter Neumann,et al.  A Cross-Language Question/Answering-System for German and English , 2003, CLEF.

[70]  Jatin Sharma,et al.  POS Tagging of English-Hindi Code-Mixed Social Media Content , 2014, EMNLP.

[71]  Harsh Jhamtani,et al.  Word-level Language Identification in Bi-lingual Code-switched Texts , 2014, PACLIC.

[72]  Ying Li,et al.  A Mandarin-English Code-Switching Corpus , 2012, LREC.

[73]  Thomas Niesler,et al.  Automatic Speech Recognition of English-isiZulu Code-switched Speech from South African Soap Operas , 2016, SLTU.

[74]  Amitava Das,et al.  Part-of-Speech Tagging for Code-Mixed English-Hindi Twitter and Facebook Chat Messages , 2015, RANLP.

[75]  Saurabh Singh,et al.  All that is English may be Hindi: Enhancing language identification through automatic ranking of the likeliness of word borrowing in social media , 2017, EMNLP.

[76]  Marelie H. Davel,et al.  Implications of Sepedi/English code switching for ASR systems , 2013 .

[77]  Ying Li,et al.  Language modeling for mixed language speech recognition using weighted phrase extraction , 2013, INTERSPEECH.

[78]  Nick Campbell TALKING FOREIGN - concatenative speech synthesis and the language barrier , 2001, INTERSPEECH.

[79]  Julia Hirschberg,et al.  Crowdsourcing Universal Part-of-Speech Tags for Code-Switching , 2017, INTERSPEECH.

[80]  Yang Liu,et al.  Part-of-Speech Tagging for English-Spanish Code-Switched Text , 2008, EMNLP.

[81]  Chng Eng Siong,et al.  Study of Semi-supervised Approaches to Improving English-Mandarin Code-Switching Speech Recognition , 2018, INTERSPEECH.

[82]  Tomoki Toda,et al.  Evaluation of cross-language voice conversion based on GMM and straight , 2001, INTERSPEECH.

[83]  Mark Sebba,et al.  Contact Languages: Pidgins and Creoles , 1997 .

[84]  Ying Li,et al.  Code-Switch Language Model with Inversion Constraints for Mixed Language Speech Recognition , 2012, COLING.

[85]  Lori Lamel,et al.  The French-Algerian Code-Switching Triggered audio corpus (FACST) , 2018, LREC.

[86]  Haizhou Li,et al.  Integration of language identification into a recognition system for spoken conversations containing code-Switches , 2012, SLTU.

[87]  Barbara E. Bullock,et al.  Metrics for Modeling Code-Switching Across Corpora , 2017, INTERSPEECH.

[88]  R. Hickey The Handbook of Language Contact , 2010 .

[89]  Dong Yu,et al.  Investigating End-to-end Speech Recognition for Mandarin-english Code-switching , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[90]  Yulia Tsvetkov,et al.  Incorporating Dialectal Variability for Socially Equitable Language Identification , 2017, ACL.

[91]  J. Herring,et al.  Building bilingual corpora , 2014 .

[92]  Alan W. Black,et al.  Foreign accents in synthetic speech: development and evaluation , 2005, INTERSPEECH.

[93]  Loreto Todd,et al.  Pidgins and Creoles , 1974 .

[94]  Monojit Choudhury,et al.  Curriculum Design for Code-switching: Experiments with Language Identification and Language Modeling with Deep Neural Networks , 2017, ICON.

[95]  P ? ? ? ? ? ? ? % ? ? ? ? , 1991 .

[96]  David A. van Leeuwen,et al.  Investigating Bilingual Deep Neural Networks for Automatic Recognition of Code-switching Frisian Speech , 2016, SLTU.

[97]  Shana Poplack,et al.  Sometimes I'll Start a Sentence in Spanish Y Termino En Espanol: toward a Typology of Code-switching 1 , 2010 .

[98]  Carol Myers-Scotton,et al.  Contact Linguistics: Bilingual encounters and grammatical outcomes , 2013 .

[99]  Thomas Niesler,et al.  Synthesising isiZulu-English Code-Switch Bigrams Using Word Embeddings , 2017, INTERSPEECH.

[100]  A. Backus Codeswitching and language change: One thing leads to another? , 2005 .

[101]  Monojit Choudhury,et al.  Phone Merging For Code-Switched Speech Recognition , 2018, CodeSwitch@ACL.

[102]  Mohammed Attia,et al.  GHHT at CALCS 2018: Named Entity Recognition for Dialectal Arabic Using Neural Networks , 2018, CodeSwitch@ACL.

[103]  Alan W. Black,et al.  WebShodh: A Code Mixed Factoid Question Answering System for Web , 2017, CLEF.

[104]  Julia Hirschberg,et al.  Named Entity Recognition on Code-Switched Data: Overview of the CALCS 2018 Shared Task , 2018, CodeSwitch@ACL.

[105]  Tan Lee,et al.  Automatic Recognition of Cantonese-English Code-Mixing Speech , 2009, ROCLING/IJCLCLP.

[106]  Dan Roth,et al.  Learning Question Classifiers , 2002, COLING.

[107]  Almeida Jacqueline Toribio,et al.  Code Switching and X-Bar Theory : The Functional Head Constraint , 2008 .

[108]  Ralph Grishman,et al.  Hindi-english cross-lingual question-answering system , 2003, TALIP.

[109]  D. Sankoff A formal production-based explanation of the facts of code-switching , 1998, Bilingualism: Language and Cognition.

[110]  Sunayana Sitaram,et al.  Homophone Identification and Merging for Code-switched Speech Recognition , 2018, INTERSPEECH.

[111]  Vishal Gupta,et al.  Transliteration Better than Translation? Answering Code-mixed Questions over a Knowledge Base , 2018, CodeSwitch@ACL.

[112]  Tsuyoshi Murata,et al.  {m , 1934, ACML.

[113]  Rohit Sinha,et al.  Hindi-English Code-Switching Speech Corpus , 2018, ArXiv.

[114]  Chung-Hsien Wu,et al.  CECOS: A Chinese-English code-switching speech database , 2011, 2011 International Conference on Speech Database and Assessments (Oriental COCOSDA).

[115]  Bo Xu,et al.  Chinese-English bilingual speech recognition , 2003, International Conference on Natural Language Processing and Knowledge Engineering, 2003. Proceedings. 2003.

[116]  Haizhou Li,et al.  A first speech recognition system for Mandarin-English code-switch conversational speech , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[117]  P. Auer A postscript: code-switching and social identity , 2005 .

[118]  Lin-Shan Lee,et al.  An Improved Framework for Recognizing Highly Imbalanced Bilingual Code-Switched Lectures with Cross-Language Acoustic Modeling and Frame-Level Language Identification , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[119]  Monojit Choudhury,et al.  Estimating Code-Switching on Twitter with a Novel Generalized Word-Level Language Detection Technique , 2017, ACL.

[120]  Goran Glavas,et al.  Spanish NER with Word Representations and Conditional Random Fields , 2016, NEWS@ACM.