A flexible stream architecture for ASR using articulatory features

Recently, speech recognition systems based on articulatory features such as “voicing” or the position of lips and tongue have gained interest, because they promise advantages with respect to robustness and permit new adaptation methods to compensate for channel, noise, and speaker variability. These approaches are also interesting from a general point of view, because their models use phonological and phonetic concepts, which allow for a richer description of a speech act than the sequence of HMM-states, which is the prevalent ASR architecture today. In this work, we present a multi-stream architecture, in which CD-HMMS are supported by detectors for articulatory features, using a linear combination of log-likelihood scores. This multi-stream approach results in a 15% reduction of WER on a read Broadcast-News (BN) task and improves performance on a spontaneous scheduling task (ESST) by 7%. The proposed architecture potentially allows for new speaker and channel adaptation schemes, including stream asynchronicity.

[1]  Mari Ostendorf,et al.  Moving beyond the 'beads-on-a-string' model of speech , 1999 .

[2]  Steven Greenberg,et al.  From here to utility - melding phonetic insight with speech technology , 2001, INTERSPEECH.

[3]  A. Waibel,et al.  A one-pass decoder based on polymorphic linguistic context assignment , 2001, IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01..

[4]  Alexander H. Waibel,et al.  Learning state-dependent stream weights for multi-codebook HMM speech recognition systems , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[5]  C. Simon Blackburn,et al.  Articulatory methods for speech production and recognition , 1996 .

[6]  Ellen Eide Distinctive features for use in an automatic speech recognition system , 2001, INTERSPEECH.

[7]  Wolfgang Wahlster,et al.  Verbmobil: Foundations of Speech-to-Speech Translation , 2000, Artificial Intelligence.

[8]  Li Deng,et al.  Integrated-multilingual speech recognition using universal phonological features in a functional speech production model , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[9]  Katrin Kirchhoff Combining articulatory and acoustic information for speech recognition in noisy and reverberant environments , 1998, ICSLP.

[10]  Klaus Ries,et al.  The Karlsruhe-Verbmobil speech recognition engine , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[11]  Noam Chomsky,et al.  The Sound Pattern of English , 1968 .

[12]  Katrin Kirchhoff,et al.  Robust speech recognition using articulatory information , 1998 .