Speech corpora are key components needed by both: linguists (in language analyses, research and teaching languages) and Natural Language Processing (NLP) researchers (in training and evaluating several NLP tasks such as speech recognition, text-to-speech and speech-to-text synthesis). Despite of the great demand, there is still a huge shortage in available corpora, especially in the case of dialectal languages, and code-switched speech. In this paper, we present our efforts in collecting and analyzing a speech corpus for conversational Egyptian Arabic. As in other multilingual societies, it is common among Egyptians to use a mix of Arabic and English in daily conversations. The act of switching languages, at sentence boundaries or within the same sentence, is referred to as code-switching. The aim of this work is a three-fold: (1) gather conversational Egyptian Arabic spontaneous speech, (2) obtain manual transcriptions and (3) analyze the speech from the code-switching perspective. A subset of the transcriptions were manually annotated for part-of-speech (POS) tags. The POS distribution of the embedded words was analyzed as well as the POS distribution for the trigger words (Arabic words preceding a code-switching point). The speech corpus can be obtained by contacting the authors.
[1]
Chng Eng Siong,et al.
Mandarin–English code-switching speech corpus in South-East Asia: SEAME
,
2015,
Lang. Resour. Evaluation.
[2]
Slim Abdennadher,et al.
Collecting Data for Automatic Speech Recognition Systems in Dialectal Arabic Using Games with a Purpose
,
2014,
MA3HMI@INTERSPEECH.
[3]
Haizhou Li,et al.
Collection and annotation of Malay conversational speech corpus
,
2012,
2012 International Conference on Speech Database and Assessments.
[4]
Slim Abdennadher,et al.
Modern standard Arabic based multilingual approach for dialectal Arabic speech recognition
,
2009,
2009 Eighth International Symposium on Natural Language Processing.
[5]
Tan Lee,et al.
Development of a Cantonese-English code-mixing speech corpus
,
2005,
INTERSPEECH.
[6]
Ying Li,et al.
A Mandarin-English Code-Switching Corpus
,
2012,
LREC.
[7]
Dimitra Vergyri,et al.
Cross-dialectal data sharing for acoustic modeling in Arabic speech recognition
,
2005,
Speech Commun..