论文信息 - Convolutional Neural Networks for Distant Speech Recognition

Convolutional Neural Networks for Distant Speech Recognition

We investigate convolutional neural networks (CNNs) for large vocabulary distant speech recognition, trained using speech recorded from a single distant microphone (SDM) and multiple distant microphones (MDM). In the MDM case we explore a beamformed signal input representation compared with the direct use of multiple acoustic channels as a parallel input to the CNN. We have explored different weight sharing approaches, and propose a channel-wise convolution with two-way pooling. Our experiments, using the AMI meeting corpus, found that CNNs improve the word error rate (WER) by 6.5% relative compared to conventional deep neural network (DNN) models and 15.7% over a discriminatively trained Gaussian mixture model (GMM) baseline. For cross-channel CNN training, the WER improves by 3.5% relative over the comparable DNN structure. Compared with the best beamformed GMM system, cross-channel convolution reduces the WER by 9.7% relative, and matches the accuracy of a beamformed DNN.

[1] Geoffrey E. Hinton,et al. Phoneme recognition using time-delay neural networks , 1989, IEEE Trans. Acoust. Speech Signal Process..

[2] Geoffrey E. Hinton,et al. A time-delay neural network architecture for isolated word recognition , 1990, Neural Networks.

[3] Richard Lippmann,et al. Neural Network Classifiers Estimate Bayesian a posteriori Probabilities , 1991, Neural Computation.

[4] Hervé Bourlard,et al. Connectionist Speech Recognition: A Hybrid Approach , 1993 .

[5] Alexander H. Waibel,et al. Improving the MS-TDNN for word spotting , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[6] Hervé Bourlard,et al. Connectionist probability estimators in HMM speech recognition , 1994, IEEE Trans. Speech Audio Process..

[7] Hervé Bourlard,et al. Neural networks for statistical recognition of continuous speech , 1995, Proc. IEEE.

[8] Yoshua Bengio,et al. Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[9] Yoshua Bengio,et al. Convolutional networks for images, speech, and time series , 1998 .

[10] Mark J. F. Gales,et al. Semi-tied covariance matrices for hidden Markov models , 1999, IEEE Trans. Speech Audio Process..

[11] T. Poggio,et al. Hierarchical models of object recognition in cortex , 1999, Nature Neuroscience.

[12] F ChenStanley,et al. An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[13] Daniel P. W. Ellis,et al. Tandem connectionist feature extraction for conventional HMM systems , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[14] Daniel P. W. Ellis,et al. Connectionist speech recognition of Broadcast News , 2002, Speech Commun..

[15] David Miller,et al. From switchboard to fisher: telephone collection protocols, their uses and yields , 2003, INTERSPEECH.

[16] Somsak Sukittanon,et al. Convolutional networks for speech detection , 2004, INTERSPEECH.

[17] Andreas Stolcke,et al. Using MLP features in SRI's conversational speech recognition system , 2005, INTERSPEECH.

[18] Jonathan G. Fiscus,et al. Multiple Dimension Levenshtein Edit Distance Calculations for Evaluating Automatic Speech Recognition Systems During Simultaneous Speech , 2006, LREC.

[19] Yee Whye Teh,et al. A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[20] Jean Carletta,et al. Unleashing the killer corpus: experiences in creating the multi-everything AMI Meeting Corpus , 2007, Lang. Resour. Evaluation.

[21] Jan Cernocký,et al. Probabilistic and Bottle-Neck Features for LVCSR of Meetings , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[22] Marc'Aurelio Ranzato,et al. Unsupervised Learning of Invariant Feature Hierarchies with Applications to Object Recognition , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[23] Xavier Anguera Miró,et al. Acoustic Beamforming for Speaker Diarization of Meetings , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[24] Brian Kingsbury,et al. Boosted MMI for model and feature-space discriminative training , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[25] John McDonough,et al. Distant Speech Recognition , 2009 .

[26] Honglak Lee,et al. Unsupervised feature learning for audio classification using convolutional deep belief networks , 2009, NIPS.

[27] Daniel Povey,et al. The Kaldi Speech Recognition Toolkit , 2011 .

[28] Andreas Stolcke,et al. Making themost from multiple microphones in meeting recognition , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[29] Tara N. Sainath,et al. Making Deep Belief Networks effective for large vocabulary continuous speech recognition , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[30] Tara N. Sainath,et al. FUNDAMENTAL TECHNOLOGIES IN MODERN SPEECH RECOGNITION Digital Object Identifier 10.1109/MSP.2012.2205597 , 2012 .

[31] Tara N. Sainath,et al. Auto-encoder bottleneck features using deep belief networks , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[32] Dong Yu,et al. Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[33] Yifan Gong,et al. Improving wideband speech recognition using mixed-bandwidth training data in CD-DNN-HMM , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[34] Gerald Penn,et al. Applying Convolutional Neural Networks concepts to hybrid NN-HMM model for speech recognition , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[35] Lukás Burget,et al. Transcribing Meetings With the AMIDA Systems , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[36] Bhiksha Raj,et al. Microphone Array Processing for Distant Speech Recognition: From Close-Talking Microphones to Far-Field Sensors , 2012, IEEE Signal Processing Magazine.

[37] Dimitri Palaz,et al. Estimating phoneme class conditional probabilities from raw speech signal using convolutional neural networks , 2013, INTERSPEECH.

[38] Steve Renals,et al. Hybrid acoustic models for distant and multichannel large vocabulary speech recognition , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[39] Tara N. Sainath,et al. Deep convolutional neural networks for LVCSR , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[40] Dong Yu,et al. Exploring convolutional neural network structures and optimization techniques for speech recognition , 2013, INTERSPEECH.

[41] Tara N. Sainath,et al. Improvements to Deep Convolutional Neural Networks for LVCSR , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[42] Ian J. Goodfellow,et al. Pylearn2: a machine learning research library , 2013, ArXiv.