Analysis of GlobalPhone and Ethiopian Languages Speech Corpora for Multilingual ASR

In this paper, we present the analysis of GlobalPhone (GP) and speech corpora of Ethiopian languages (Amharic, Tigrigna, Oromo and Wolaytta). The aim of the analysis is to select speech data from GP for the development of multilingual Automatic Speech Recognition (ASR) system for the Ethiopian languages. To this end, phonetic overlaps among GP and Ethiopian languages have been analyzed. The result of our analysis shows that there is much phonetic overlap among Ethiopian languages although they are from three different language families. From GP, Turkish, Uyghur and Croatian are found to have much overlap with the Ethiopian languages. On the other hand, Korean has less phonetic overlap with the rest of the languages. Moreover, morphological complexity of the GP and Ethiopian languages, reflected by type to token ration (TTR) and out of vocabulary (OOV) rate, has been analyzed. Both metrics indicated the morphological complexity of the languages. Korean and Amharic have been identified as extremely morphologically complex compared to the other languages. Tigrigna, Russian, Turkish, Polish, etc. are also among the morphologically complex languages.

[1]  Michael A. Covington,et al.  Cutting the Gordian Knot: The Moving-Average Type–Token Ratio (MATTR) , 2010, J. Quant. Linguistics.

[2]  Ryan Cotterell,et al.  On the Complexity and Typology of Inflectional Morphological Systems , 2018, TACL.

[3]  Tanja Schultz,et al.  Language-independent and language-adaptive acoustic modeling for speech recognition , 2001, Speech Commun..

[4]  Tanja Schultz,et al.  TOWARDS RAPID LANGUAGE PORTABILITY OF SPEECH PROCESSING SYSTEMS , 2004 .

[5]  Ngoc Thang Vu,et al.  GlobalPhone: A multilingual text & speech database in 20 languages , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[6]  Christian Bentz,et al.  A Comparison Between Morphological Complexity Measures: Typological Data vs. Language Corpora , 2016, CL4LC@COLING 2016.

[7]  Kenneth Katzner,et al.  Languages of the World , 1977 .

[8]  Ngoc Thang Vu,et al.  Multilingual deep neural network based acoustic modeling for rapid language adaptation , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  M. Dryer,et al.  The Languages of the World , 1997 .

[10]  Paul Dalsgaard,et al.  Data-driven identification of poly- and mono-phonemes for four european languages , 1993, EUROSPEECH.

[11]  Kimmo Kettunen,et al.  Can Type-Token Ratio be Used to Show Morphological Complexity of Languages?* , 2014, J. Quant. Linguistics.

[12]  G. B. Varile Multilingual Speech Processing , 2005 .

[13]  Solomon Teferra Abate,et al.  Large Vocabulary Read Speech Corpora for Four Ethiopian Languages: Amharic, Tigrigna, Oromo and Wolaytta , 2020, LREC.

[14]  Solomon Teferra Abate,et al.  An Amharic speech corpus for large vocabulary continuous speech recognition , 2005, INTERSPEECH.

[15]  Tanja Schultz,et al.  Acoustic-Phonetic Unit Similarities For Context Dependent Acoustic Model Portability , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.