ASR corpus design for resource-scarce languages

We investigate the number of speakers and the amount of data that is required for the development of useable speakerindependent speech-recognition systems in resource-scarce languages. Our experiments employ the Lwazi corpus, which contains speech in the eleven official languages of South Africa. We find that a surprisingly small number of speakers (fewer than 50) and around 10 to 20 hours of speech per language are sufficient for the purposes of acceptable phone-based recognition. Index Terms: speech recognition, corpus design

[1]  Kazuhiro Kondo,et al.  An evaluation of cross-language adaptation for rapid HMM development in a new language , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[2]  Dale Schuurmans,et al.  Characterizing Rational Versus Exponential learning Curves , 1995, J. Comput. Syst. Sci..

[3]  Dale Schuurmans Characterizing Rational Versus Exponential learning Curves , 1997, J. Comput. Syst. Sci..

[4]  Dilek Z. Hakkani-Tür,et al.  Active and unsupervised learning for automatic speech recognition , 2003, INTERSPEECH.

[5]  Lou Boves,et al.  In search of optimal data selection for training of automatic speech recognition systems , 2003, 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721).

[6]  Rong Zhang,et al.  Data selection for speech recognition , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[7]  Marelie Hattingh Davel,et al.  Data requirements for speaker independent acoustic models , 2008 .

[8]  E. Barnard,et al.  Phonetics of intonation in South African Bantu languages , 2008 .

[9]  Etienne Barnard,et al.  Pronunciation prediction with Default&Refine , 2008, Comput. Speech Lang..