Developing speech recognition systems for corpus indexing under the IARPA Babel program

Automatic speech recognition is a core component of many applications, including keyword search. In this paper we describe experiments on acoustic modeling, language modeling, and decoding for keyword search on a Cantonese conversational telephony corpus collected as part of the IARPA Babel program. We show that acoustic modeling techniques such as the bootstrapped-and-restructured model and deep neural network acoustic model significantly outperform a state-of-the-art baseline GMM/HMM model, in terms of both recognition performance and keyword search performance, with improvements of up to 11% relative character error rate reduction and 31% relative maximum term weighted value improvement. We show that while an interpolated Model M and neural network LM improve recognition performance, they do not improve keyword search results; however, the advanced LM does reduce the size of the keyword search index. Finally, we show that a simple form of automatically adapted keyword search performs 16% better than a preindexed search system, indicating that out-of-vocabulary search is still a challenge.

[1]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition , 2012 .

[2]  Tara N. Sainath,et al.  Auto-encoder bottleneck features using deep belief networks , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  James Martens,et al.  Deep learning via Hessian-free optimization , 2010, ICML.

[4]  Jonathan G. Fiscus,et al.  Results of the 2006 Spoken Term Detection Evaluation , 2006 .

[5]  Ebru Arisoy,et al.  Large Scale Hierarchical Neural Network Language Models , 2012, INTERSPEECH.

[6]  George Saon,et al.  Dynamic network decoding revisited , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[7]  Brian Kingsbury,et al.  Exploiting diversity for spoken term detection , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[8]  Geoffrey Zweig,et al.  Advances in speech transcription at IBM under the DARPA EARS program , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[9]  Bhuvana Ramabhadran,et al.  Effect of pronounciations on OOV queries in spoken term detection , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[10]  Bhuvana Ramabhadran,et al.  Query-by-example Spoken Term Detection For OOV terms , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[11]  Tara N. Sainath,et al.  FUNDAMENTAL TECHNOLOGIES IN MODERN SPEECH RECOGNITION Digital Object Identifier 10.1109/MSP.2012.2205597 , 2012 .

[12]  Bhuvana Ramabhadran,et al.  Vocabulary independent spoken term detection , 2007, SIGIR.

[13]  Xiaodong Cui,et al.  System combination and score normalization for spoken term detection , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[14]  Brian Kingsbury,et al.  The IBM Attila speech recognition toolkit , 2010, 2010 IEEE Spoken Language Technology Workshop.

[15]  Dong Yu,et al.  Feature engineering in Context-Dependent Deep Neural Networks for conversational speech transcription , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[16]  Tara N. Sainath,et al.  Making Deep Belief Networks effective for large vocabulary continuous speech recognition , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[17]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[18]  Jan Cernocký,et al.  Probabilistic and Bottle-Neck Features for LVCSR of Meetings , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[19]  Tara N. Sainath,et al.  Scalable Minimum Bayes Risk Training of Deep Neural Network Acoustic Models Using Distributed Hessian-free Optimization , 2012, INTERSPEECH.

[20]  Bhuvana Ramabhadran,et al.  Scaling shrinkage-based language models , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[21]  John R. Hershey,et al.  Hidden Markov Acoustic Modeling With Bootstrap and Restructuring for Low-Resourced Languages , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[22]  Hervé Bourlard,et al.  A Continuous Speech Recognition System Embedding MLP into HMM , 1989, NIPS.

[23]  Richard Sproat,et al.  Lattice-Based Search for Spoken Utterance Retrieval , 2004, NAACL.