Speed Perturbation and Vowel Duration Modeling for ASR in Hausa and Wolof Languages

Automatic Speech Recognition (ASR) for (under-resourced) Sub-Saharan African languages faces several challenges: small amount of transcribed speech, written language normalization issues, few text resources available for language modeling, as well as specific features (tones, morphology, etc.) that need to be taken into account seriously to optimize ASR performance. This paper tries to address some of the above challenges through the development of ASR systems for two Sub-Saharan African languages: Hausa and Wolof. First, we investigate data augmentation technique (through speed perturbation) to overcome the lack of resources. Secondly, the main contribution is our attempt to model vowel length contrast existing in both languages. For reproducible experiments, the ASR systems developed for Hausa and Wolof are made available to the research community on github. To our knowledge, the Wolof ASR system presented in this paper is the first large vocabulary continuous speech recognition system ever developed for this language.

[1]  Daniel Povey Phone duration modeling for LVCSR , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[2]  Rena Nemoto,et al.  Phone duration modeling using clustering of rich contexts , 2013, INTERSPEECH.

[3]  Cordelia Schmid,et al.  Transformation Pursuit for Image Classification , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[4]  ROXANA MA NEWMAN,et al.  An Acoustic and Phonological Study of Pre-Pausal Vowel Length in Hausa , 1981 .

[5]  Tanja Schultz,et al.  Automatic speech recognition for under-resourced languages: A survey , 2014, Speech Commun..

[6]  Navdeep Jaitly,et al.  Vocal Tract Length Perturbation (VTLP) improves speech recognition , 2013 .

[7]  Geoffrey E. Hinton A Practical Guide to Training Restricted Boltzmann Machines , 2012, Neural Networks: Tricks of the Trade.

[8]  Etienne Barnard,et al.  Wolof Speech Recognition Model of Digits and Limited-Vocabulary Based on HMM and ToolKit , 2012, 2012 UKSim 14th International Conference on Computer Modelling and Simulation.

[9]  Ngoc Thang Vu,et al.  Hausa large vocabulary continuous speech recognition , 2012, SLTU.

[10]  Laurent Besacier,et al.  Collecting Resources in Sub-Saharan African Languages for Automatic Speech Recognition: a Case Study of Wolof , 2016, LREC.

[11]  Jean Léopold Diouf Dictionnaire wolof-français et français-wolof , 2003 .

[12]  Mikko Kurimo,et al.  Duration modeling techniques for continuous speech recognition , 2004, INTERSPEECH.

[13]  Venkata Ramana Rao,et al.  MODELING WORD DURATION FOR BETTER SPEECH RECOGNITION , 2008 .

[14]  Brian Kingsbury,et al.  Lattice-based optimization of sequence classification criteria for neural-network acoustic modeling , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[15]  Arame Fal,et al.  Dictionnaire wolof-français ; suivi d'un index français-wolof , 1990 .

[16]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[17]  Sylvie Nouguier-Voisin Relations entre fonctions syntaxiques et fonctions sémantiques en wolof , 2002 .

[18]  Paul Newman,et al.  Hausa Language , 2000 .

[19]  Sanjeev Khudanpur,et al.  Audio augmentation for speech recognition , 2015, INTERSPEECH.

[20]  Martine Adda-Decker,et al.  Parallel Speech Collection for Under-resourced Language Studies Using the Lig-Aikuma Mobile Device App , 2016, SLTU.

[21]  Erich Elsen,et al.  Deep Speech: Scaling up end-to-end speech recognition , 2014, ArXiv.

[22]  Mark J. F. Gales,et al.  Data augmentation for low resource languages , 2014, INTERSPEECH.

[23]  Hermann Ney,et al.  A word graph algorithm for large vocabulary continuous speech recognition , 1994, Comput. Speech Lang..