Automatic estimation of dialect mixing ratio for dialect speech recognition

This paper proposes methods for determining an appropriate mixing ratio of dialects in automatic speech recognition (ASR) for dialects. To handle ASR for various dialects, it has been reported to be effective to train a language model using a dialectmixed corpus. One reason behind this is geographical continuity of spoken dialect; we regard spoken dialect as a mixture of various dialects. This mixing ratio changes at every moment as well as depends on a speaker. We can improve recognition accuracybygivingan appropriatedialectmixingratio foraspeaker’s dialect. The mixing ratio is generally unknown and requires to be estimated and updated referring to input utterances. We handle two methods for updating it based on recognition results; one is to compute contribution of dialects for each recognized word, and the other is to predict mixture information referring to a whole recognized sentence based on topic modeling. The experimental result shows that the mixing ratio estimated by these methods realized higher recognition accuracy than a fixed mixing ratio. Index Terms: dialect, supervised latent Dirichlet allocation (sLDA), mixing ratio.

[1]  Hiroshi G. Okuno,et al.  Statistical Method of Building Dialect Language Models for ASR Systems , 2012, COLING.

[2]  Kiyohiro Shikano,et al.  Julius - an open source real-time large vocabulary recognition engine , 2001, INTERSPEECH.

[3]  Shuichi Itahashi,et al.  JNAS: Japanese speech corpus for large vocabulary continuous speech recognition research , 1999 .

[4]  Jason Eisner,et al.  Lexical Semantics , 2020, The Handbook of English Linguistics.

[5]  Howard B. Woods,et al.  A socio-dialectology survey of the English spoken in Ottawa : a study of sociological and stylistic variation in Canadian English , 1979 .

[6]  Tatsuya Kawahara,et al.  A bootstrapping approach for developing language model of new spoken dialogue systems by selecting web texts , 2006, INTERSPEECH.

[7]  Thomas L. Griffiths,et al.  The nested chinese restaurant process and bayesian nonparametric inference of topic hierarchies , 2007, JACM.

[8]  M. A. Anusuya,et al.  Speech Recognition by Machine, A Review , 2010, ArXiv.

[9]  D.R. Reddy,et al.  Speech recognition by machine: A review , 1976, Proceedings of the IEEE.

[10]  Kikuo Maekawa,et al.  Balanced corpus of contemporary written Japanese , 2013, Language Resources and Evaluation.

[11]  P. C. Ching,et al.  From phonology and acoustic properties to automatic recognition of Cantonese , 1994, Proceedings of ICSIPNN '94. International Conference on Speech, Image Processing and Neural Networks.

[12]  K. Maekawa CORPUS OF SPONTANEOUS JAPANESE : ITS DESIGN AND EVALUATION , 2003 .

[13]  David M. Blei,et al.  Supervised Topic Models , 2007, NIPS.

[14]  Johan Schalkwyk,et al.  OpenFst: A General and Efficient Weighted Finite-State Transducer Library , 2007, CIAA.

[15]  Dau-Cheng Lyu,et al.  Speech Recognition on Code-Switching Among the Chinese Dialects , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[16]  Edgar W. Schneider,et al.  The Americas and the Caribbean , 2008 .

[17]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[18]  David R. Miller,et al.  Statistical dialect classification based on mean phonetic features , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.