Towards using hierarchical posteriors for flexible automatic speech recognition systems

Local state (or phone) posterior probabilities are often investigated as local classifiers (e.g., hybrid HMM/ANN systems) or as transformed acoustic features (e.g., ``Tandem'') towards improved speech recognition systems. In this paper, we present initial results towards boosting these approaches by improving the local state, phone, or word posterior estimates, using all possible acoustic information (as available in the whole utterance), as well as possible prior information (such as topological constraints). Furthermore, this approach results in a family of new HMM based systems, where only (local and global) posterior probabilities are used, while also providing a new, principled, approach towards a hierarchical use/integration of these posteriors, from the frame level up to the sentence level. Initial results on several speech (as well as other multimodal) tasks resulted in significant improvements. In this paper, we present recognition results on Numbers'95 and on a reduced vocabulary version (1000 words) of the DARPA Conversational Telephone Speech-to-text (CTS) task.

[1]  L. Baum,et al.  An inequality and associated maximization technique in statistical estimation of probabilistic functions of a Markov process , 1972 .

[2]  M. Hunt A statistical approach to metrics for word and syllable recognition , 1979 .

[3]  Pierre A. Devijver,et al.  Baum's forward-backward algorithm revisited , 1985, Pattern Recognit. Lett..

[4]  H Hermansky,et al.  Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.

[5]  Hervé Bourlard,et al.  Connectionist Speech Recognition: A Hybrid Approach , 1993 .

[6]  Keinosuke Fukunaga,et al.  Statistical Pattern Recognition , 1993, Handbook of Pattern Recognition and Computer Vision.

[7]  N. Morgan,et al.  A training algorithm for statistical sequence recognition with applications to transition-based speech recognition , 1996, IEEE Signal Processing Letters.

[8]  Hervé Bourlard,et al.  Estimation of global posteriors and forward-backward training of hybrid HMM/ANN systems , 1997, EUROSPEECH.

[9]  Nikki Mirghafori,et al.  Combining connectionist multi-band and full-band probability streams for speech recognition of natural numbers , 1998, ICSLP.

[10]  R. Cole,et al.  TELEPHONE SPEECH CORPUS DEVELOPMENT AT CSLU , 1998 .

[11]  Hervé Bourlard,et al.  Confidence Measures in Hybrid HMM/ANN Speech Recognition , 1998 .

[12]  Hynek Hermansky,et al.  TRAPS - classifiers of temporal patterns , 1998, ICSLP.

[13]  Hervé Bourlard,et al.  Improving posterior based confidence measures in hybrid HMM/ANN speech recognition systems , 1998, ICSLP.

[14]  Steve Renals,et al.  Confidence measures from local posterior probability estimates , 1999, Comput. Speech Lang..

[15]  D. Ellis,et al.  CONNECTIONIST FEATURE EXTRACTION FOR CONVENTIONAL HMM SYSTEMS , 1999 .

[16]  Andreas Stolcke,et al.  THE SRI MARCH 2000 HUB-5 CONVERSATIONAL SPEECH TRANSCRIPTION SYSTEM , 2000 .

[17]  Andreas Stolcke,et al.  Finding consensus in speech recognition: word error minimization and other applications of confusion networks , 2000, Comput. Speech Lang..

[18]  Hynek Hermansky,et al.  Qualcomm-ICSI-OGI features for ASR , 2002, INTERSPEECH.

[19]  Michael S. Lewicki,et al.  Efficient coding of natural sounds , 2002, Nature Neuroscience.

[20]  Hynek Hermansky,et al.  Hierarchical tandem feature extraction , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[21]  Word-Level Confidence Estimation for Automatic Speech Recognition , 2002 .

[22]  Daniel P. W. Ellis,et al.  Error visualization for tandem acoustic modeling on the Aurora task , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[23]  Nelson Morgan,et al.  Learning long-term temporal features in LVCSR using neural networks , 2004, INTERSPEECH.

[24]  Hynek Hermansky,et al.  Entropy based combination of tandem representations for noise robust ASR , 2004, INTERSPEECH.

[25]  Sherif Abdou,et al.  Beam search pruning in speech recognition using a posterior probability-based confidence measure , 2004, Speech Commun..

[26]  Andreas Stolcke,et al.  On using MLP features in LVCSR , 2004, INTERSPEECH.

[27]  Eric Horvitz,et al.  Layered representations for learning and inferring office activity from multiple sensory channels , 2004, Comput. Vis. Image Underst..

[28]  Samy Bengio,et al.  Modeling Individual and Group Actions in Meetings: A Two-Layer HMM Framework , 2004, 2004 Conference on Computer Vision and Pattern Recognition Workshop.