论文信息 - Phone Recognition with the Mean-Covariance Restricted Boltzmann Machine

Phone Recognition with the Mean-Covariance Restricted Boltzmann Machine

Straightforward application of Deep Belief Nets (DBNs) to acoustic modeling produces a rich distributed representation of speech data that is useful for recognition and yields impressive results on the speaker-independent TIMIT phone recognition task. However, the first-layer Gaussian-Bernoulli Restricted Boltzmann Machine (GRBM) has an important limitation, shared with mixtures of diagonal-covariance Gaussians: GRBMs treat different components of the acoustic input vector as conditionally independent given the hidden state. The mean-covariance restricted Boltzmann machine (mcRBM), first introduced for modeling natural images, is a much more representationally efficient and powerful way of modeling the covariance structure of speech data. Every configuration of the precision units of the mcRBM specifies a different precision matrix for the conditional distribution over the acoustic space. In this work, we use the mcRBM to learn features of speech data that serve as input into a standard DBN. The mcRBM features combined with DBNs allow us to achieve a phone error rate of 20.5%, which is superior to all published results on speaker-independent TIMIT to date.

[1] Frederick Jelinek,et al. Continuous speech recognition , 1977, SGAR.

[2] Geoffrey E. Hinton,et al. Learning representations by back-propagating errors , 1986, Nature.

[3] Hsiao-Wuen Hon,et al. Speaker-independent phone recognition using hidden Markov models , 1989, IEEE Trans. Acoust. Speech Signal Process..

[4] Radford M. Neal. Connectionist Learning of Belief Networks , 1992, Artif. Intell..

[5] Mari Ostendorf,et al. Fast algorithms for phone classification and recognition using segment-based models , 1992, IEEE Trans. Signal Process..

[6] Anthony J. Robinson,et al. An application of recurrent nets to phone probability estimation , 1994, IEEE Trans. Neural Networks.

[7] Geoffrey E. Hinton,et al. Bayesian Learning for Neural Networks , 1995 .

[8] James R. Glass,et al. Heterogeneous measurements and multiple classifiers for speech recognition , 1998, ICSLP.

[9] Francis Jack Smith,et al. Improved phone recognition using Bayesian triphone models , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[10] Steve J. Young,et al. Statistical Modeling in Continuous Speech Recognition (CSR) , 2001, UAI.

[11] John Scott Bridle,et al. Towards better understanding of the model implied by the use of dynamic features in HMMs , 2004, INTERSPEECH.

[12] Christopher K. I. Williams. How to Pretend That Correlated Variables Are Independent by Using Difference Observations , 2005, Neural Computation.

[13] Mark J. F. Gales,et al. Minimum phone error training of precision matrix models , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[14] Yee Whye Teh,et al. A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[15] Eric Fosler-Lussier,et al. Combining phonetic attributes using conditional random fields , 2006, INTERSPEECH.

[16] Lawrence K. Saul,et al. Large Margin Gaussian Mixture Modeling for Phonetic Classification and Recognition , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[17] Dong Yu,et al. Use of Differential Cepstra as Acoustic Features in Hidden Trajectory Modeling for Phonetic Recognition , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[18] Volodymyr Mnih,et al. CUDAMat: a CUDA-based matrix class for Python , 2009 .

[19] Geoffrey E. Hinton,et al. 3D Object Recognition with Deep Belief Nets , 2009, NIPS.

[20] Tara N. Sainath,et al. An exploration of large vocabulary tools for small vocabulary phonetic recognition , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[21] Steve Renals,et al. Speech Recognition Using Augmented Conditional Random Fields , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[22] Geoffrey E. Hinton,et al. Deep Belief Networks for phone recognition , 2009 .

[23] Geoffrey E. Hinton,et al. Modeling pixel means and covariances using factorized third-order boltzmann machines , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[24] Geoffrey E. Hinton,et al. Factored 3-Way Restricted Boltzmann Machines For Modeling Natural Images , 2010, AISTATS.