DNN-Based Multilingual Automatic Speech Recognition for Wolaytta using Oromo Speech

It is known that Automatic Speech Recognition (ASR) is very useful for human-computer interaction in all the human languages. However, due to its requirement for a big speech corpus, which is very expensive, it has not been developed for most of the languages. Multilingual ASR (MLASR) has been suggested to share existing speech corpora among related languages to develop an ASR for languages which do not have the required speech corpora. Literature shows that phonetic relatedness goes across language families. We have, therefore, conducted experiments on MLASR taking two language families: one as source (Oromo from Cushitic) and the other as target (Wolaytta from Omotic). Using Oromo Deep Neural Network (DNN) based acoustic model, Wolaytta pronunciation dictionary and language model we have achieved Word Error Rate (WER) of 48.34% for Wolaytta. Moreover, our experiments show that adding only 30 minutes of speech data from the target language (Wolaytta) to the whole training data (22.8 hours) of the source language (Oromo) results in a relative WER reduction of 32.77%. Our results show the possibility of developing ASR system for a language, if we have pronunciation dictionary and language model, using an existing speech corpus of another language irrespective of their language family.

[1]  Georg Heigold,et al.  Multilingual acoustic models using distributed deep neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[2]  Solomon Teferra Abate,et al.  Analysis of GlobalPhone and Ethiopian Languages Speech Corpora for Multilingual ASR , 2020, LREC.

[3]  Solomon Teferra Abate,et al.  Deep Neural Networks Based Automatic Speech Recognition for Four Ethiopian Languages , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Sanjeev Khudanpur,et al.  Audio augmentation for speech recognition , 2015, INTERSPEECH.

[5]  Sanjeev Khudanpur,et al.  A time delay neural network architecture for efficient modeling of long temporal contexts , 2015, INTERSPEECH.

[6]  F ChenStanley,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[7]  Yiming Wang,et al.  Semi-Orthogonal Low-Rank Matrix Factorization for Deep Neural Networks , 2018, INTERSPEECH.

[8]  Ngoc Thang Vu,et al.  GlobalPhone: A multilingual text & speech database in 20 languages , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[9]  Andreas Stolcke,et al.  A study of multilingual speech recognition , 1997, EUROSPEECH.

[10]  Markus Müller,et al.  Using language adaptive deep neural networks for improved multilingual speech recognition , 2015, IWSLT.

[11]  Tanja Schultz,et al.  Multilingual and Crosslingual Speech Recognition , 1998 .

[12]  Ekapol Chuangsuwanich,et al.  Multilingual techniques for low resource automatic speech recognition , 2016 .

[13]  Gonfa Yadeta A Large Vocabulary, Speaker-Independent, Continuous Speech Recognition System for Afaan Oromo: Using Broadcast News Speech Corpus , 2016 .

[14]  Tanja Schultz,et al.  Globalphone: a multilingual speech and text database developed at karlsruhe university , 2002, INTERSPEECH.

[15]  Tanja Schultz,et al.  Automatic speech recognition for under-resourced languages: A survey , 2014, Speech Commun..

[16]  Kassahun Gelana A Continuous, Speaker Independent Speech Recognizer for Afaan Oroomoo: Afaan Oroomoo Speech Recognition Using HMM Model , 2011 .

[17]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[18]  Catherine Griefenow-Mewis,et al.  A grammatical sketch of written Oromo , 2001 .

[19]  Ngoc Thang Vu,et al.  Multilingual deep neural network based acoustic modeling for rapid language adaptation , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Hermann Ney,et al.  Multilingual acoustic modeling using graphemes , 2003, INTERSPEECH.

[21]  Florian Metze,et al.  Multilingual Speech Recognition with Corpus Relatedness Sampling , 2019, INTERSPEECH.

[22]  Tanja Schultz,et al.  Language-independent and language-adaptive acoustic modeling for speech recognition , 2001, Speech Commun..

[23]  Solomon Teferra Abate,et al.  DNN-Based Speech Recognition for Globalphone Languages , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Solomon Teferra Abate,et al.  Large Vocabulary Read Speech Corpora for Four Ethiopian Languages: Amharic, Tigrigna, Oromo and Wolaytta , 2020, LREC.

[25]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition , 2012 .

[26]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.