Recovery from Model Inconsistency in Multilingual Speech Recognition Report from JHU workshop 2007

Current ASR has difficulties in handling unexpected words that are typically replaced by acoustically acceptable high prior probability words. Identifying parts of the message where such a replacement could have happened may allow for corrective strategies. We aim to develop data-guided techniques that would yield unconstrained estimates of posterior probabilities of sub-word classes employed in the stochastic model solely from the acoustic evidence, i.e. without use of higher level language constraints. These posterior probabilities then could be compared with the constrained estimates of posterior probabilities derived with the constraints implied by the underlying stochastic model. Parts of the message where any significant mismatch between these two probability distributions is found should be reexamined and corrective strategies applied. This may allow for development of systems that are able to indicate when they " do not know " and eventually may be able to " learn-as-you-go " in applications encountering new situations and new languages. During the 2007 Summer Workshop we intend to focus on detection and description of out-of-vocabulary and mispronounced words in the 6 language Call-home database. Additionally, in order to describe the suspect parts of the message, we will work on language-independent recognizer of speech sounds that could be applied for phonetic transcription of identified suspect parts of the recognized message.

[1]  Alex Acero,et al.  Maximum Entropy Confidence Estimation for Speech Recognition , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[2]  Lukás Burget,et al.  The AMI System for the Transcription of Speech in Meetings , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[3]  Hynek Hermansky,et al.  Detection of out-of-vocabulary words in posterior based ASR , 2007, INTERSPEECH.

[4]  Pavel Matejka,et al.  Hierarchical Structures of Neural Networks for Phoneme Recognition , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[5]  Jont B. Allen,et al.  Articulation and Intelligibility , 2005, Synthesis Lectures on Speech and Audio Processing.

[6]  Hermann Ney,et al.  Confidence measures for large vocabulary continuous speech recognition , 2001, IEEE Trans. Speech Audio Process..

[7]  Fernando Pereira,et al.  Efficient general lattice generation and rescoring , 1999, EUROSPEECH.

[8]  E. Plante,et al.  Time course of word identification and semantic integration in spoken language. , 1999, Journal of experimental psychology. Learning, memory, and cognition.

[9]  Andreas Stolcke,et al.  Finding consensus among words: lattice-based word error minimization , 1999, EUROSPEECH.

[10]  Hervé Bourlard,et al.  Improving posterior based confidence measures in hybrid HMM/ANN speech recognition systems , 1998, ICSLP.

[11]  Mitchell P. Marcus,et al.  Maximum entropy models for natural language ambiguity resolution , 1998 .

[12]  G. Karmos,et al.  Adaptive modeling of the unattended acoustic environment reflected in the mismatch negativity event-related potential , 1996, Brain Research.

[13]  Ronald Rosenfeld,et al.  Adaptive Statistical Language Modeling; A Maximum Entropy Approach , 1994 .

[14]  Dennis H. Klatt,et al.  Review of the ARPA speech understanding project , 1990 .

[15]  A. Boothroyd,et al.  Mathematical treatment of context effects in phoneme and word recognition. , 1988, The Journal of the Acoustical Society of America.

[16]  G. A. Miller,et al.  The intelligibility of speech as a function of the context of the test materials. , 1951, Journal of experimental psychology.