Hindi-English Code-Switching Speech Corpus

Code-switching refers to the usage of two languages within a sentence or discourse. It is a global phenomenon among multilingual communities and has emerged as an independent area of research. With the increasing demand for the code-switching automatic speech recognition (ASR) systems, the development of a code-switching speech corpus has become highly desirable. However, for training such systems, very limited code-switched resources are available as yet. In this work, we present our first efforts in building a code-switching ASR system in the Indian context. For that purpose, we have created a Hindi-English code-switching speech database. The database not only contains the speech utterances with code-switching properties but also covers the session and the speaker variations like pronunciation, accent, age, gender, etc. This database can be applied in several speech signal processing applications, such as code-switching ASR, language identification, language modeling, speech synthesis etc. This paper mainly presents an analysis of the statistics of the collected code-switching speech corpus. Later, the performance results for the ASR task have been reported for the created database.

[1]  Lin-Shan Lee,et al.  An integrated framework for transcribing Mandarin-English code-mixed lectures with improved acoustic and language modeling , 2010, 2010 7th International Symposium on Chinese Spoken Language Processing.

[2]  Suzanne Romaine One Speaker, Two Languages: Cross-Disciplinary Perspectives on Code-Switching , 1997 .

[3]  Tan Lee,et al.  Semantics-based language modeling for Cantonese-English code-mixing speech recognition , 2010, 2010 7th International Symposium on Chinese Spoken Language Processing.

[4]  Marelie H. Davel,et al.  Implications of Sepedi/English code switching for ASR systems , 2013 .

[5]  Chad Nilep "Code Switching" in Sociocultural Linguistics , 2006 .

[6]  Lalit Malik Socio-linguistics: A study of code-switching , 1994 .

[7]  Smita Sinha,et al.  Code Switching and Code Mixing Among Oriya Trilingual Children - A Study , 2009 .

[8]  David A. van Leeuwen,et al.  A Longitudinal Bilingual Frisian-Dutch Radio Broadcast Database Designed for Code-Switching Research , 2016, LREC.

[9]  Slim Abdennadher,et al.  Collection and Analysis of Code-switch Egyptian Arabic-English Speech Corpus , 2018, LREC.

[10]  C. Myers-Scotton Codeswitching with English: types of switching, types of communities , 1989 .

[11]  E. Brody Life with Two Languages: An Introduction to Bilingualism , 1985 .

[12]  Mauro Cettolo,et al.  IRSTLM: an open source toolkit for handling large scale language models , 2008, INTERSPEECH.

[13]  Thamar Solorio,et al.  Baby-Steps Towards Building a Spanglish Language Model , 2009, CICLing.

[14]  Sunil Kumar Kopparapu,et al.  Mixed Language Speech Recognition without Explicit Identification of Language , 2012 .

[15]  Amitava Das,et al.  Code-Mixing in Social Media Text. The Last Language Identification Frontier? , 2013, Trait. Autom. des Langues.

[16]  Pascale Fung,et al.  A Hindi-English Code-Switching Corpus , 2014, LREC.

[17]  C. Myers-Scotton Social Motivations For Codeswitching: Evidence from Africa , 1994 .

[18]  Tan Lee,et al.  Development of a Cantonese-English code-mixing speech corpus , 2005, INTERSPEECH.

[19]  Slim Abdennadher,et al.  Building a First Language Model for Code-switch Arabic-English , 2017, ACLING.

[20]  Thomas Niesler,et al.  A First South African Corpus of Multilingual Code-switched Soap Opera Speech , 2018, LREC.

[21]  Sunita Malhotra Hindi-english, Code Switching and Language Choice in Urban, Uppermiddle-class Indian Families , 1980 .

[22]  Haizhou Li,et al.  SEAME: a Mandarin-English code-switching speech corpus in south-east asia , 2010, INTERSPEECH.

[23]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[24]  Jatin Sharma,et al.  “I am borrowing ya mixing ?" An Analysis of English-Hindi Code Mixing in Facebook , 2014, CodeSwitch@EMNLP.

[25]  Dau-Cheng Lyu,et al.  Language identification on code-switching utterances using multiple cues , 2008, INTERSPEECH.

[26]  Tien Ping Tan,et al.  Automatic Speech Recognition of Code Switching Speech Using 1-Best Rescoring , 2012, 2012 International Conference on Asian Language Processing.

[27]  J. Flege Second Language Speech Learning Theory , Findings , and Problems , 2006 .

[28]  Lori Lamel,et al.  The French-Algerian Code-Switching Triggered audio corpus (FACST) , 2018, LREC.

[29]  Haizhou Li,et al.  A first speech recognition system for Mandarin-English code-switch conversational speech , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[30]  Hervé Bourlard,et al.  MediaParl: Bilingual mixed language accented speech database , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[31]  Chung-Hsien Wu,et al.  CECOS: A Chinese-English code-switching speech database , 2011, 2011 International Conference on Speech Database and Assessments (Oriental COCOSDA).

[32]  Dau-Cheng Lyu,et al.  Speech Recognition on Code-Switching Among the Chinese Dialects , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.