Submodular data selection with acoustic and phonetic features for automatic speech recognition

In this paper, we propose to use acoustic feature based submodular function optimization to select a subset of untranscribed data for manual transcription, and retrain the initial acoustic model with the additional transcribed data. The acoustic features are obtained from an unsupervised Gaussian mixture model. We also integrate the acoustic features with the phonetic features, which are obtained from an initial ASR system, in the submodular function. Submodular function optimization has been theoretically shown its near-optimal guarantee. We performed the experiments on 1000 hours of Mandarin mobile phone speech, in which 300 hours of initial data was for the training of an initial acoustic model. The experimental results show that the acoustic feature based approach, which does not rely on an initial ASR system, performs as well as the phonetic feature based approach. Moreover, there is complementary effect between the acoustic feature based and the phonetic feature based data selection. The submodular function with the combined features provides a relative 4.8% character error rate (CER) reduction over the corresponding ASR system using random selection. We also include the desired feature distribution obtained from a development set in a generalized function, but the improvement is insignificant.

[1]  Bin Ma,et al.  Acoustic TextTiling for story segmentation of spoken documents , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Jeff A. Bilmes,et al.  Submodular subset selection for large-scale speech training data , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Gökhan Tür,et al.  Active learning for spoken language understanding , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[4]  Olivier Siohan,et al.  Training data selection based on context-dependent state matching , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Hermann Ney,et al.  Unsupervised training of acoustic models for large vocabulary continuous speech recognition , 2005, IEEE Transactions on Speech and Audio Processing.

[6]  M. L. Fisher,et al.  An analysis of approximations for maximizing submodular set functions—I , 1978, Math. Program..

[7]  Bin Ma,et al.  Acoustic Segment Modeling with Spectral Clustering Methods , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[8]  Koichi Shinoda,et al.  Committee-Based Active Learning for Speech Recognition , 2011, IEICE Trans. Inf. Syst..

[9]  Jeff A. Bilmes,et al.  Unsupervised submodular subset selection for speech data , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Alexander H. Waibel,et al.  Unsupervised training of a speech recognizer: recent experiments , 1999, EUROSPEECH.

[11]  Dong Yu,et al.  Active Learning and Semi-supervised Learning for Speech Recognition: a Unified Framework Using the Global Entropy Reduction Maximization Criterion Computer Speech and Language Article in Press Active Learning and Semi-supervised Learning for Speech Recognition: a Unified Framework Using the Global E , 2022 .

[12]  Bin Ma,et al.  Low-resource keyword search strategies for tamil , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Bin Ma,et al.  Unsupervised mining of acoustic subword units with segment-level Gaussian posteriorgrams , 2013, INTERSPEECH.

[14]  Hui Lin,et al.  How to select a good training-data subset for transcription: submodular active selection for sequences , 2009, INTERSPEECH.

[15]  Gerard G. L. Meyer,et al.  Selective sampling of training data for speech recognition , 2002 .

[16]  Bin Ma,et al.  Unsupervised data selection and word-morph mixed language model for tamil low-resource keyword search , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Brian Kingsbury,et al.  Boosted MMI for model and feature-space discriminative training , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[18]  Dilek Z. Hakkani-Tür,et al.  Active learning for automatic speech recognition , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[19]  Yusuke Shinohara A submodular optimization approach to sentence set selection , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Tara N. Sainath,et al.  N-best entropy based data selection for acoustic modeling , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Francis R. Bach,et al.  Learning with Submodular Functions: A Convex Optimization Perspective , 2011, Found. Trends Mach. Learn..

[22]  Xiaojin Zhu,et al.  --1 CONTENTS , 2006 .

[23]  Dilek Z. Hakkani-Tür,et al.  Active and unsupervised learning for automatic speech recognition , 2003, INTERSPEECH.

[24]  Burr Settles,et al.  Active Learning Literature Survey , 2009 .

[25]  Gökhan Tür,et al.  Combining active and semi-supervised learning for spoken language understanding , 2005, Speech Commun..

[26]  Delphine Charlet,et al.  Confidence-measure-driven unsupervised incremental adaptation for HMM-based speech recognition , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[27]  Jeff A. Bilmes,et al.  Using Document Summarization Techniques for Speech Data Subset Selection , 2013, NAACL.

[28]  Haizhou Li,et al.  An overview of text-independent speaker recognition: From features to supervectors , 2010, Speech Commun..

[29]  Yaodong Zhang,et al.  Unsupervised speech processing with applications to query-by-example spoken term detection , 2013 .

[30]  Rong Zhang,et al.  Data selection for speech recognition , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).