A Survey of Code-switched Speech and Language Processing

Code-switching, the alternation of languages within a conversation or utterance, is a common communicative phenomenon that occurs in multilingual communities across the world. This survey reviews computational approaches for codeswitched Speech and Natural Language Processing. We motivate why processing code-switched text and speech is essential for building intelligent agents and systems that interact with users in multilingual communities. As code-switching data and resources are scarce, we list what is available in various code-switched language pairs with the language processing tasks they can be used for. We review code-switching research in various Speech and NLP applications, including language processing tools and end-to-end systems. We conclude with future directions and open problems in the field.

[1]  Loreto Todd,et al.  Pidgins and Creoles , 1974 .

[2]  Shana Poplack,et al.  Sometimes I’ll start a sentence in Spanish Y TERMINO EN ESPAÑOL: toward a typology of code-switching1 , 1980 .

[3]  Aravind K. Joshi,et al.  Processing of Sentences With Intra-Sentential Code-Switching , 1982, COLING.

[4]  J. Gumperz Discourse strategies: Introduction , 1982 .

[5]  M. Heller Language and social identity: Negotiations of language choice in Montreal , 1983 .

[6]  Pieter Muysken,et al.  Government and code-mixing , 1986, Journal of Linguistics.

[7]  N. Smith,et al.  Pidgins and Creoles. An Introduction , 1994 .

[8]  A. Camilleri Language Values and Identities: Code Switching in Secondary Classrooms in Malta. , 1996 .

[9]  Mark Sebba,et al.  Contact Languages: Pidgins and Creoles , 1997 .

[10]  Andreas Stolcke,et al.  A study of multilingual speech recognition , 1997, EUROSPEECH.

[11]  D. Sankoff A formal production-based explanation of the facts of code-switching , 1998, Bilingualism: Language and Cognition.

[12]  Mark Sebba A Congruence Approach to the Syntax of Codeswitching , 1998 .

[13]  N. Poulisse,et al.  Duelling Languages: Grammatical Structure in Codeswitching , 1998 .

[14]  Tomoki Toda,et al.  Evaluation of cross-language voice conversion based on GMM and straight , 2001, INTERSPEECH.

[15]  Nick Campbell TALKING FOREIGN - concatenative speech synthesis and the language barrier , 2001, INTERSPEECH.

[16]  Dan Roth,et al.  Learning Question Classifiers , 2002, COLING.

[17]  Carol Myers-Scotton,et al.  Contact Linguistics: Bilingual encounters and grammatical outcomes , 2013 .

[18]  Bo Xu,et al.  Chinese-English bilingual speech recognition , 2003, International Conference on Natural Language Processing and Knowledge Engineering, 2003. Proceedings. 2003.

[19]  Ralph Grishman,et al.  Hindi-english cross-lingual question-answering system , 2003, TALIP.

[20]  Claudia Barolo,et al.  Language independent phoneme mapping for foreign TTS , 2004, SSW.

[21]  Tan Lee,et al.  Detection of language boundary in code-switching utterances by bi-phone probabilities , 2004, 2004 International Symposium on Chinese Spoken Language Processing.

[22]  Penelope Gardner-Chloros,et al.  Assumptions Behind Grammatical Approaches To Code-Switching: When The Blueprint Is A Red Herring , 2004 .

[23]  A. Backus Codeswitching and language change: One thing leads to another? , 2005 .

[24]  Sadaoki Furui,et al.  Polyglot synthesis using a mixture of monolingual corpora , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[25]  Tan Lee,et al.  Development of a Cantonese-English code-mixing speech corpus , 2005, INTERSPEECH.

[26]  Alan W. Black,et al.  Foreign accents in synthetic speech: development and evaluation , 2005, INTERSPEECH.

[27]  P. Auer A postscript: code-switching and social identity , 2005 .

[28]  Dau-Cheng Lyu,et al.  Speech Recognition on Code-Switching Among the Chinese Dialects , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[29]  Sivaji Bandyopadhyay,et al.  Dialogue based Question Answering System in Telugu , 2006 .

[30]  Thamar Solorio,et al.  Baby-Steps Towards Building a Spanglish Language Model , 2009, CICLing.

[31]  Almeida Jacqueline Toribio,et al.  Code Switching and X-Bar Theory : The Functional Head Constraint , 2008 .

[32]  Yang Liu,et al.  Part-of-Speech Tagging for English-Spanish Code-Switched Text , 2008, EMNLP.

[33]  Tan Lee,et al.  Automatic Recognition of Cantonese-English Code-Mixing Speech , 2009, ROCLING/IJCLCLP.

[34]  Qiang Wang,et al.  Codeswitching in the primary EFL classroom in China – Two case studies , 2009 .

[35]  R. Hickey The Handbook of Language Contact , 2010 .

[36]  Haizhou Li,et al.  SEAME: a Mandarin-English code-switching speech corpus in south-east asia , 2010, INTERSPEECH.

[37]  Mitchell Peabody,et al.  Methods for pronunciation assessment in computer aided language learning , 2011 .

[38]  Chung-Hsien Wu,et al.  CECOS: A Chinese-English code-switching speech database , 2011, 2011 International Conference on Speech Database and Assessments (Oriental COCOSDA).

[39]  Haizhou Li,et al.  A first speech recognition system for Mandarin-English code-switch conversational speech , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[40]  Tien Ping Tan,et al.  Automatic Speech Recognition of Code Switching Speech Using 1-Best Rescoring , 2012, 2012 International Conference on Asian Language Processing.

[41]  Ying Li,et al.  Code-Switch Language Model with Inversion Constraints for Mixed Language Speech Recognition , 2012, COLING.

[42]  Sunil Kumar Kopparapu,et al.  Mixed Language Speech Recognition without Explicit Identification of Language , 2012 .

[43]  Haizhou Li,et al.  Integration of language identification into a recognition system for spoken conversations containing code-Switches , 2012, SLTU.

[44]  Haizhou Li,et al.  Recurrent neural network language modeling for code switching conversational speech , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[45]  Ben King,et al.  Labeling the Languages of Words in Mixed-Language Documents using Weakly Supervised Methods , 2013, NAACL.

[46]  Kishore Prahallad,et al.  Is word-to-phone mapping better than phone-phone mapping for handling English words? , 2013, ACL.

[47]  Joachim Wagner,et al.  Code Mixing: A Challenge for Language Identification in the Language of Social Media , 2014, CodeSwitch@EMNLP.

[48]  Harsh Jhamtani,et al.  Word-level Language Identification in Bi-lingual Code-switched Texts , 2014, PACLIC.

[49]  Ying Li,et al.  Code switch language modeling with Functional Head Constraint , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[50]  Jatin Sharma,et al.  POS Tagging of English-Hindi Code-Mixed Social Media Content , 2014, EMNLP.

[51]  Jatin Sharma,et al.  “I am borrowing ya mixing ?" An Analysis of English-Hindi Code Mixing in Facebook , 2014, CodeSwitch@EMNLP.

[52]  Tetyana Lyudovyk,et al.  Code-Switching speech recognition for closely related languages , 2014, SLTU.

[53]  Monojit Choudhury,et al.  Word-level Language Identification using CRF: Code-switching Shared Task Report of MSR India System , 2014, CodeSwitch@EMNLP.

[54]  Ngoc Thang Vu,et al.  Features for factored language models for code-Switching speech , 2014, SLTU.

[55]  J. Herring,et al.  Building bilingual corpora , 2014 .

[56]  Sai Krishna Rallabandi,et al.  IIIT Hyderabad’s submission to the Blizzard Challenge 2015 , 2015 .

[57]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[58]  Mona T. Diab,et al.  Named Entity Recognition for Arabic Social Media , 2015, VS@HLT-NAACL.

[59]  Amitava Das,et al.  Part-of-Speech Tagging for Code-Mixed English-Hindi Twitter and Facebook Chat Messages , 2015, RANLP.

[60]  Lin-Shan Lee,et al.  An Improved Framework for Recognizing Highly Imbalanced Bilingual Code-Switched Lectures with Cross-Language Acoustic Modeling and Frame-Level Language Identification , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[61]  Arun Baby,et al.  Resources for Indian languages , 2016 .

[62]  Dipti Misra Sharma,et al.  Shallow Parsing Pipeline - Hindi-English Code-Mixed Social Media Text , 2016, NAACL.

[63]  Amitava Das,et al.  Collecting and Annotating Indian Social Media Code-Mixed Corpora , 2016, CICLing.

[64]  Amitava Das,et al.  CMIR: A Corpus for Evaluation of Code Mixed Information Retrieval of Hindi-English Tweets , 2016, Computación y Sistemas.

[65]  Alan W. Black,et al.  Experiments with Cross-lingual Systems for Synthesis of Code-Mixed Text , 2016, SSW.

[66]  Çagri Çöltekin,et al.  Part of Speech Annotation of a Turkish-German Code-Switching Corpus , 2016, LAW@ACL.

[67]  Goran Glavas,et al.  Spanish NER with Word Representations and Conditional Random Fields , 2016, NEWS@ACM.

[68]  Thomas Niesler,et al.  Automatic Speech Recognition of English-isiZulu Code-switched Speech from South African Soap Operas , 2016, SLTU.

[69]  Amitava Das,et al.  Comparing the Level of Code-Switching in Corpora , 2016, LREC.

[70]  David A. van Leeuwen,et al.  Investigating Bilingual Deep Neural Networks for Automatic Recognition of Code-switching Frisian Speech , 2016, SLTU.

[71]  Alan W. Black,et al.  On Building Mixed Lingual Speech Synthesis Systems , 2017, INTERSPEECH.

[72]  Dominique Estival,et al.  Multilingual Semantic Parsing And Code-Switching , 2017, CoNLL.

[73]  Monojit Choudhury,et al.  Curriculum Design for Code-switching: Experiments with Language Identification and Language Modeling with Deep Neural Networks , 2017, ICON.

[74]  Barbara E. Bullock,et al.  Metrics for Modeling Code-Switching Across Corpora , 2017, INTERSPEECH.

[75]  Alan W. Black,et al.  WebShodh: A Code Mixed Factoid Question Answering System for Web , 2017, CLEF.

[76]  Thamar Solorio,et al.  A Multi-task Approach for Named Entity Recognition in Social Media Data , 2017, NUT@EMNLP.

[77]  Thomas Niesler,et al.  Synthesising isiZulu-English Code-Switch Bigrams Using Word Embeddings , 2017, INTERSPEECH.

[78]  Suryakanth V. Gangashetty,et al.  Adapting monolingual resources for code-mixed hindi-english speech recognition , 2017, 2017 International Conference on Asian Language Processing (IALP).

[79]  Manish Shrivastava,et al.  Joining Hands: Exploiting Monolingual Treebanks for Parsing of Code-mixing Data , 2017, EACL.

[80]  Yulia Tsvetkov,et al.  Incorporating Dialectal Variability for Socially Equitable Language Identification , 2017, ACL.

[81]  Saurabh Singh,et al.  All that is English may be Hindi: Enhancing language identification through automatic ranking of the likeliness of word borrowing in social media , 2017, EMNLP.

[82]  Monojit Choudhury,et al.  Estimating Code-Switching on Twitter with a Novel Generalized Word-Level Language Detection Technique , 2017, ACL.

[83]  Monojit Choudhury,et al.  Language Modeling for Code-Mixing: The Role of Linguistic Theory based Synthetic Data , 2018, ACL.

[84]  Julia Hirschberg,et al.  Named Entity Recognition on Code-Switched Data: Overview of the CALCS 2018 Shared Task , 2018, CodeSwitch@ACL.

[85]  Vishal Gupta,et al.  Transliteration Better than Translation? Answering Code-mixed Questions over a Knowledge Base , 2018, CodeSwitch@ACL.

[86]  Preeti Rao,et al.  A Study of Lexical and Prosodic Cues to Segmentation in a Hindi-English Code-switched Discourse , 2018, INTERSPEECH.

[87]  Mohammed Attia,et al.  GHHT at CALCS 2018: Named Entity Recognition for Dialectal Arabic Using Neural Networks , 2018, CodeSwitch@ACL.

[88]  Slim Abdennadher,et al.  Collection and Analysis of Code-switch Egyptian Arabic-English Speech Corpus , 2018, LREC.

[89]  Sunayana Sitaram,et al.  Homophone Identification and Merging for Code-switched Speech Recognition , 2018, INTERSPEECH.

[90]  Alan W. Black,et al.  Tackling Code-Switched NER: Participation of CMU , 2018, CodeSwitch@ACL.

[91]  Thierry Poibeau,et al.  Dependency Parsing of Code-Switching Data with Cross-Lingual Feature Representations , 2018 .

[92]  Monojit Choudhury,et al.  Accommodation of Conversational Code-Choice , 2018, CodeSwitch@ACL.

[93]  Vinay Singh,et al.  Named Entity Recognition for Hindi-English Code-Mixed Social Media Text , 2018, NEWS@ACL.

[94]  Monojit Choudhury,et al.  Phone Merging For Code-Switched Speech Recognition , 2018, CodeSwitch@ACL.

[95]  Lori Lamel,et al.  The French-Algerian Code-Switching Triggered audio corpus (FACST) , 2018, LREC.

[96]  Tsuyoshi Murata,et al.  {m , 1934, ACML.

[97]  P. Alam ‘G’ , 2021, Composites Engineering: An A–Z Guide.

[98]  P. Alam,et al.  H , 1887, High Explosives, Propellants, Pyrotechnics.

[99]  P. Alam ‘A’ , 2021, Composites Engineering: An A–Z Guide.

[100]  P. Shukla,et al.  A bilingual parser for Hindi , English and code-switching structures , 2022 .