Adaptive windows multiple deep residual networks for speech recognition

Abstract The hybrid convolutional neural network and hidden Markov model (CNN-HMM) has recently achieved considerable performance in speech recognition because deep neural networks, model complex correlations between features. Automatic speech recognition (ASR) as an input to many intelligent and expert systems has impacts in various fields such as evolving search engines (inclusion of speech recognition in search engines), healthcare industry (medical reporting by medical personnel, and disease diagnosis expert systems), service delivery, communication in service providers (to establish the callers demands and then direct them to the appropriate operator for assistance), etc. This paper introduces a method, which further reduces the recognition error rate. In this paper, we first propose adaptive windows convolutional neural network (AWCNN) to analyze joint temporal-spectral features variation. AWCNN makes the model more robust against both intra- and inter-speaker variations. We further propose a new residual learning, which leads to better utilization of information in deep layers and provides a better control on transferring input information. The proposed speech recognition system can be used as the vocal input for many artificial and expert systems. We evaluated the proposed method on TIMIT, FARSDAT, Switchboard, and CallHome datasets and one image database i.e. MNIST. The experimental results show that the proposed method reduces the absolute error rate by 7% compared with the state-of-the-art methods in some speech recognition tasks.

[1]  Dong Yu,et al.  Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[2]  Joseph Picone,et al.  A Doubly Hierarchical Dirichlet Process Hidden Markov Model with a Non-Ergodic Structure , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[3]  Xiaodong Cui,et al.  English Conversational Telephone Speech Recognition by Humans and Machines , 2017, INTERSPEECH.

[4]  Tara N. Sainath,et al.  Neural Network Adaptive Beamforming for Robust Multichannel Speech Recognition , 2016, INTERSPEECH.

[5]  Alex Graves,et al.  Supervised Sequence Labelling with Recurrent Neural Networks , 2012, Studies in Computational Intelligence.

[6]  Philip C. Woodland,et al.  Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models , 1995, Comput. Speech Lang..

[7]  Lei Zhang,et al.  Beyond a Gaussian Denoiser: Residual Learning of Deep CNN for Image Denoising , 2016, IEEE Transactions on Image Processing.

[8]  Hervé Bourlard,et al.  Continuous speech recognition by connectionist statistical methods , 1993, IEEE Trans. Neural Networks.

[9]  Yan Liu,et al.  Deep residual learning for image steganalysis , 2018, Multimedia Tools and Applications.

[10]  Daniel Jurafsky,et al.  Hidden Conditional Random Fields for phone recognition , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[11]  Tara N. Sainath,et al.  FUNDAMENTAL TECHNOLOGIES IN MODERN SPEECH RECOGNITION Digital Object Identifier 10.1109/MSP.2012.2205597 , 2012 .

[12]  Honglak Lee,et al.  Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations , 2009, ICML '09.

[13]  Li Lee,et al.  Speaker normalization using efficient frequency warping procedures , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[14]  Pascal Vincent,et al.  The Difficulty of Training Deep Architectures and the Effect of Unsupervised Pre-Training , 2009, AISTATS.

[15]  Carl E. Rasmussen,et al.  Occam's Razor , 2000, NIPS.

[16]  Heesung Kwon,et al.  Going Deeper With Contextual CNN for Hyperspectral Image Classification , 2016, IEEE Transactions on Image Processing.

[17]  Shunzheng Yu,et al.  Hidden semi-Markov models , 2010, Artif. Intell..

[18]  Gerald Penn,et al.  Convolutional Neural Networks for Speech Recognition , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[19]  Kai Feng,et al.  Multilingual acoustic modeling for speech recognition based on subspace Gaussian Mixture Models , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[20]  Kai Feng,et al.  Subspace Gaussian Mixture Models for speech recognition , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[21]  James Martens,et al.  Deep learning via Hessian-free optimization , 2010, ICML.

[22]  Shinji Watanabe,et al.  Joint CTC-attention based end-to-end speech recognition using multi-task learning , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  Jonathan Le Roux,et al.  Discriminative Training for Large-Vocabulary Speech Recognition Using Minimum Classification Error , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[24]  Björn W. Schuller,et al.  Deep Learning for Environmentally Robust Speech Recognition , 2017, ACM Trans. Intell. Syst. Technol..

[25]  Steve Renals,et al.  Speech Recognition Using Augmented Conditional Random Fields , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[26]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[27]  Yu Zhang,et al.  Very deep convolutional networks for end-to-end speech recognition , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[28]  Geoffrey E. Hinton,et al.  Understanding how Deep Belief Networks perform acoustic modelling , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[29]  Navdeep Jaitly,et al.  Hybrid speech recognition with Deep Bidirectional LSTM , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[30]  Yi Yang,et al.  An improved residual LSTM architecture for acoustic modeling , 2017, 2017 2nd International Conference on Computer and Communication Systems (ICCCS).

[31]  Gökhan Tür,et al.  A New Pre-Training Method for Training Deep Learning Models with Application to Spoken Language Understanding , 2016, INTERSPEECH.

[32]  Tara N. Sainath,et al.  Optimization Techniques to Improve Training Speed of Deep Neural Networks for Large Speech Tasks , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[33]  Daniel P. W. Ellis,et al.  Connectionist speech recognition of Broadcast News , 2002, Speech Commun..

[34]  Hsiao-Wuen Hon,et al.  Speaker-independent phone recognition using hidden Markov models , 1989, IEEE Trans. Acoust. Speech Signal Process..

[35]  Hu Hu,et al.  Adaptive Very Deep Convolutional Residual Network for Noise Robust Speech Recognition , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[36]  Jay G. Wilpon,et al.  Modeling state durations in hidden Markov models for automatic speech recognition , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[37]  Geoffrey E. Hinton,et al.  Acoustic Modeling Using Deep Belief Networks , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[38]  Yann LeCun,et al.  Convolutional Learning of Spatio-temporal Features , 2010, ECCV.

[39]  Tara N. Sainath,et al.  Deep convolutional neural networks for LVCSR , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[40]  Hermann Ney,et al.  Improved methods for vocal tract normalization , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[41]  Hari Krishna Vydana,et al.  Residual neural networks for speech recognition , 2017, 2017 25th European Signal Processing Conference (EUSIPCO).

[42]  A. Cook,et al.  Experimental evaluation of duration modelling techniques for automatic speech recognition , 1987, ICASSP '87. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[43]  Daniel Povey,et al.  Minimum Phone Error and I-smoothing for improved discriminative training , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[44]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[45]  Steve J. Young,et al.  MMI training for continuous phoneme recognition on the TIMIT database , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[46]  Biing-Hwang Juang,et al.  Minimum classification error rate methods for speech recognition , 1997, IEEE Trans. Speech Audio Process..

[47]  Yanmin Qian,et al.  Very Deep Convolutional Neural Networks for Noise Robust Speech Recognition , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[48]  Yoshua. Bengio,et al.  Learning Deep Architectures for AI , 2007, Found. Trends Mach. Learn..

[49]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[50]  Vaibhava Goel,et al.  Advances in Very Deep Convolutional Neural Networks for LVCSR , 2016, INTERSPEECH.

[51]  George Saon,et al.  The IBM 2016 English Conversational Telephone Speech Recognition System , 2016, INTERSPEECH.

[52]  Geoffrey Zweig,et al.  Discriminative duration modeling for speech recognition with segmental conditional random fields , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[53]  Xiaoou Tang,et al.  Image Super-Resolution Using Deep Convolutional Networks , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[54]  Cheung-Chi Leung,et al.  Joint acoustic modeling of triphones and trigraphemes by multi-task learning deep neural networks for low-resource speech recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[55]  Florian Metze,et al.  End-to-End Architectures for Speech Recognition , 2017, New Era for Robust Speech Recognition, Exploiting Deep Learning.

[56]  Tara N. Sainath,et al.  Deep Convolutional Neural Networks for Large-scale Speech Tasks , 2015, Neural Networks.

[57]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[58]  Dan Klein,et al.  Learning Structured Models for Phone Recognition , 2007, EMNLP.

[59]  Stanley F. Chen,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[60]  Geoffrey E. Hinton,et al.  An Efficient Learning Procedure for Deep Boltzmann Machines , 2012, Neural Computation.

[61]  Lawrence K. Saul,et al.  Large Margin Gaussian Mixture Modeling for Phonetic Classification and Recognition , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.