论文信息 - Collection and Analysis of Code-switch Egyptian Arabic-English Speech Corpus

Collection and Analysis of Code-switch Egyptian Arabic-English Speech Corpus

Speech corpora are key components needed by both: linguists (in language analyses, research and teaching languages) and Natural Language Processing (NLP) researchers (in training and evaluating several NLP tasks such as speech recognition, text-to-speech and speech-to-text synthesis). Despite of the great demand, there is still a huge shortage in available corpora, especially in the case of dialectal languages, and code-switched speech. In this paper, we present our efforts in collecting and analyzing a speech corpus for conversational Egyptian Arabic. As in other multilingual societies, it is common among Egyptians to use a mix of Arabic and English in daily conversations. The act of switching languages, at sentence boundaries or within the same sentence, is referred to as code-switching. The aim of this work is a three-fold: (1) gather conversational Egyptian Arabic spontaneous speech, (2) obtain manual transcriptions and (3) analyze the speech from the code-switching perspective. A subset of the transcriptions were manually annotated for part-of-speech (POS) tags. The POS distribution of the embedded words was analyzed as well as the POS distribution for the trigger words (Arabic words preceding a code-switching point). The speech corpus can be obtained by contacting the authors.

[1] Chng Eng Siong,et al. Mandarin–English code-switching speech corpus in South-East Asia: SEAME , 2015, Lang. Resour. Evaluation.

[2] Slim Abdennadher,et al. Collecting Data for Automatic Speech Recognition Systems in Dialectal Arabic Using Games with a Purpose , 2014, MA3HMI@INTERSPEECH.

[3] Haizhou Li,et al. Collection and annotation of Malay conversational speech corpus , 2012, 2012 International Conference on Speech Database and Assessments.

[4] Slim Abdennadher,et al. Modern standard Arabic based multilingual approach for dialectal Arabic speech recognition , 2009, 2009 Eighth International Symposium on Natural Language Processing.

[5] Tan Lee,et al. Development of a Cantonese-English code-mixing speech corpus , 2005, INTERSPEECH.

[6] Ying Li,et al. A Mandarin-English Code-Switching Corpus , 2012, LREC.

[7] Dimitra Vergyri,et al. Cross-dialectal data sharing for acoustic modeling in Arabic speech recognition , 2005, Speech Commun..