The French-Algerian Code-Switching Triggered audio corpus (FACST)

The French Algerian Code-Switching Triggered corpus (FACST) was created in order to support a variety of studies in phonetics, prosody and natural language processing. The first aim of the FACST corpus is to collect a spontaneous Code-switching speech (CS) corpus. In order to obtain a large quantity of spontaneous CS utterances in natural conversations experiments were carried out on how to elicit CS. Applying a triggering protocol by means of code-switched questions was found to be effective in eliciting CS in the responses. To ensure good audio quality, all recordings were made in a soundproof room or in a very calm room. This paper describes FACST corpus, along with the principal steps to build a CS speech corpus in French-Algerian languages and data collection steps. We also explain the selection criteria for the CS speakers and the recording protocols used. We present the methods used for data segmentation and annotation, and propose a conventional transcription of this type of speech in each language with the aim of being well-suited for both computational linguistic and acoustic-phonetic studies. We provide an a quantitative description of the FACST corpus along with results of linguistic studies, and discuss some of the challenges we faced in collecting CS data.

[1]  Ngoc Thang Vu,et al.  Challenges of Computational Processing of Code-Switching , 2016, CodeSwitch@EMNLP.

[2]  Mark Liberman,et al.  Transcriber: Development and use of a tool for assisting speech corpora production , 2001, Speech Commun..

[3]  Dau-Cheng Lyu,et al.  Language identification on code-switching utterances using multiple cues , 2008, INTERSPEECH.

[4]  G. B. Varile Multilingual Speech Processing , 2005 .

[5]  Jean-Luc Gauvain,et al.  Investigating techniques for low resource conversational speech recognition , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Comlan-Zéphirin Tossa Phénoménes de contact de langues dans le parler bilingue fongbe-français , 1998 .

[7]  Jean-Luc Gauvain,et al.  Conversational telephone speech recognition , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[8]  Nizar Habash,et al.  A Conventional Orthography for Algerian Arabic , 2015, ANLP@ACL.

[9]  Ryan Cotterell,et al.  An Algerian Arabic-French Code-Switched Corpus , 2014 .

[10]  Pieter Muysken,et al.  Bilingual Speech: A Typology of Code-Mixing , 2000 .

[11]  Suzanne Romaine One Speaker, Two Languages: Cross-Disciplinary Perspectives on Code-Switching , 1997 .

[12]  Marelie H. Davel,et al.  Implications of Sepedi/English code switching for ASR systems , 2013 .

[13]  François Grosjean,et al.  One speaker, two languages: A psycholinguistic approach to code-switching: the recognition of guest words by bilinguals , 1995 .

[14]  Chng Eng Siong,et al.  Mandarin–English code-switching speech corpus in South-East Asia: SEAME , 2015, Lang. Resour. Evaluation.

[15]  Mark Sebba,et al.  On the notions of congruence and convergence in code-switching , 2009 .

[16]  Hilda Kebeya,et al.  Inter- and intra-sentential switching: are they really comparable? , 2013 .

[17]  Nizar Habash,et al.  On Arabic Transliteration , 2007 .

[18]  Pieter Muysken,et al.  Research techniques for the study of code-switching , 2009 .

[19]  Nizar Habash,et al.  Automatic Transliteration of Romanized Dialectal Arabic , 2014, CoNLL.

[20]  Alfred Lameli,et al.  Language and Space: An International Handbook of Linguistic Variation , 2010 .

[21]  Nizar Habash,et al.  Transliteration of Arabizi into Arabic Orthography: Developing a Parallel Annotated Arabizi-Arabic Script SMS/Chat Corpus , 2014, ANLP@EMNLP.

[22]  Page E. Piccinini Cross-language Activation and the Phonetics of Code-switching , 2016 .

[23]  Karima Ziamari Le code switching au Maroc : l'arabe marocain au contact du français , 2008 .