DNN-Based Speech Recognition for Globalphone Languages

This paper describes new reference benchmark results based on hybrid Hidden Markov Model and Deep Neural Networks (HMM-DNN) for the GlobalPhone (GP) multilingual text and speech database. GP is a multilingual database of high-quality read speech with corresponding transcriptions and pronunciation dictionaries in more than 20 languages. Moreover, we provide new results for five additional languages, namely, Amharic, Oromo, Tigrigna, Wolaytta, and Uyghur. Across the 22 languages considered, the hybrid HMM-DNN models outperform the HMM-GMM based models regardless of the size of the training speech used. Overall, we achieved relative improvements that range from 7.14% to 59.43%.

[1]  Solomon Teferra Abate,et al.  An Amharic speech corpus for large vocabulary continuous speech recognition , 2005, INTERSPEECH.

[2]  G. B. Varile Multilingual Speech Processing , 2005 .

[3]  W S McCulloch,et al.  A logical calculus of the ideas immanent in nervous activity , 1990, The Philosophy of Artificial Intelligence.

[4]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[5]  Tanja Schultz,et al.  TOWARDS RAPID LANGUAGE PORTABILITY OF SPEECH PROCESSING SYSTEMS , 2004 .

[6]  M. Dryer,et al.  The Languages of the World , 1997 .

[7]  Tanja Schultz,et al.  GlobalPhone: Pronunciation Dictionaries in 20 Languages , 2014, LREC.

[8]  Ngoc Thang Vu,et al.  GlobalPhone: A multilingual text & speech database in 20 languages , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[9]  Tanja Schultz,et al.  Globalphone: a multilingual speech and text database developed at karlsruhe university , 2002, INTERSPEECH.

[10]  Yiming Wang,et al.  Semi-Orthogonal Low-Rank Matrix Factorization for Deep Neural Networks , 2018, INTERSPEECH.

[11]  Sanjeev Khudanpur,et al.  A time delay neural network architecture for efficient modeling of long temporal contexts , 2015, INTERSPEECH.

[12]  Sanjeev Khudanpur,et al.  Audio augmentation for speech recognition , 2015, INTERSPEECH.

[13]  Solomon Teferra Abate,et al.  Large Vocabulary Read Speech Corpora for Four Ethiopian Languages: Amharic, Tigrigna, Oromo and Wolaytta , 2020, LREC.

[14]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition , 2012 .

[15]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[16]  Kenneth Katzner,et al.  Languages of the World , 1977 .