Designing text corpus using phone-error distribution for acoustic modeling

It is expensive to prepare a sufficient amount of training data for acoustic modeling for developing large vocabulary continuous speech recognition systems. This is a serious problem especially for resource-deficient languages. We propose an active learning method that effectively reduces the amount of training data without any degradation in recognition performance. It is used to design a text corpus for read speech collection. It first estimates phone-error distribution using a small amount of fully transcribed speech data. Second, it constructs a sentence set whose phone-occurrence distribution is close to the phone-error distribution and collects its speech data. It then extends this process to diphones and triphones and collects more speech data. We evaluated our method with simulation experiments using the Corpus of Spontaneous Japanese. It required only 76 h of speech data to achieve word accuracy of 74.7%, while the conventional training method required 152 h of data to achieve the same rate.

[1]  Ren-Yuan Lyu,et al.  Automatic selection of phonetically distributed sentence sets for speaker adaptation with application to large vocabulary Mandarin speech recognition , 1999, Comput. Speech Lang..

[2]  Vaibhava Goel,et al.  Active learning with minimum expected error for spoken language understanding , 2005, INTERSPEECH.

[3]  Wei Li,et al.  An active approach to speaker and task adaptation based on automatic analysis of vocabulary confusability , 2007, INTERSPEECH.

[4]  Koichi Shinoda,et al.  Speech modeling based on committee-based active learning , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[5]  Hitoshi Isahara,et al.  Spontaneous Speech Corpus of Japanese , 2000, LREC.

[6]  Gerard G. L. Meyer,et al.  Robustness aspects of active learning for acoustic modeling , 2004, INTERSPEECH.

[7]  R. A. Leibler,et al.  On Information and Sufficiency , 1951 .

[8]  Kikuo Maekawa Compilation of the Balanced Corpus of Contemporary Written Japanese in the KOTONOHA Initiative (Invited Paper) , 2008, 2008 Second International Symposium on Universal Communication.

[9]  Abeer Alwan,et al.  Efficient adaptation text design based on the Kullback-Leibler measure , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[10]  Dong Yu,et al.  Maximizing global entropy reduction for active learning in speech recognition , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[11]  Koichi Shinoda,et al.  Speaker adaptation based on two-step active learning , 2009, INTERSPEECH.

[12]  Gökhan Tür,et al.  An active approach to spoken language processing , 2006, TSLP.