ACID/HNN: clustering hierarchies of neural networks for context-dependent connectionist acoustic modeling

We present the ACID/HNN framework, a principled approach to hierarchical connectionist acoustic modeling in large vocabulary conversational speech recognition (LVCSR). Our approach consists of an agglomerative clustering algorithm based on information divergence (ACID) to automatically design and robustly estimate hierarchies of neural networks (HNN) for arbitrarily large sets of context-dependent decision tree clustered HMM states. We argue that a hierarchical approach is crucial in applying locally discriminative connectionist models to the typically very large state spaces observed in LVCSR systems. We evaluate the ACID/HNN framework on the Switchboard conversational telephone speech corpus. Furthermore, we focus on the benefits of the proposed connectionist acoustic model, namely exploiting the hierarchical structure for speaker adaptation and decoding speed-up algorithms.

[1]  Wolfgang Doster,et al.  A decision theoretic approach to hierarchical classifier design , 1984, Pattern Recognit..

[2]  Hervé Bourlard,et al.  Connectionist Speech Recognition: A Hybrid Approach , 1993 .

[3]  Horacio Franco,et al.  Context-dependent connectionist probability estimation in a hybrid hidden Markov model-neural net speech recognition system , 1994, Comput. Speech Lang..

[4]  Philip C. Woodland,et al.  Speaker adaptation of HMMs using linear regression , 1994 .

[5]  Anthony J. Robinson,et al.  Context-Dependent Classes in a Hybrid Recurrent Network-HMM Speech Recognition System , 1995, NIPS.

[6]  Ivica Rogina,et al.  The bucket box intersection (BBI) algorithm for fast approximative evaluation of diagonal mixture Gaussians , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[7]  Steve R. Waterhouse,et al.  Transcription of broadcast television and radio news: the 1996 ABBOT system , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[8]  Alexander H. Waibel,et al.  Context-dependent hybrid HME/HMM speech recognition using polyphone clustering decision trees , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[9]  J. Fritsch,et al.  ACID/HNN: a framework for hierarchical connectionist acoustic modeling , 1997, 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings.