The NTT CHiME-3 system: Advances in speech enhancement and recognition for mobile multi-microphone devices

CHiME-3 is a research community challenge organised in 2015 to evaluate speech recognition systems for mobile multi-microphone devices used in noisy daily environments. This paper describes NTT's CHiME-3 system, which integrates advanced speech enhancement and recognition techniques. Newly developed techniques include the use of spectral masks for acoustic beam-steering vector estimation and acoustic modelling with deep convolutional neural networks based on the "network in network" concept. In addition to these improvements, our system has several key differences from the official baseline system. The differences include multi-microphone training, dereverberation, and cross adaptation of neural networks with different architectures. The impacts that these techniques have on recognition performance are investigated. By combining these advanced techniques, our system achieves a 3.45% development error rate and a 5.83% evaluation error rate. Three simpler systems are also developed to perform evaluations with constrained set-ups.

[1]  J.-M. Boucher,et al.  A New Method Based on Spectral Subtraction for Speech Dereverberation , 2001 .

[2]  Geoffrey E. Hinton Training Products of Experts by Minimizing Contrastive Divergence , 2002, Neural Computation.

[3]  Daniel P. W. Ellis,et al.  An EM Algorithm for Localizing Multiple Sound Sources in Reverberant Environments , 2006, NIPS.

[4]  Mark J. F. Gales,et al.  Progress in the CU-HTK broadcast news transcription system , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[5]  E.A.P. Habets,et al.  Dual-Microphone Speech Dereverberation in a Noisy Environment , 2006, 2006 IEEE International Symposium on Signal Processing and Information Technology.

[6]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[7]  Geoffrey Zweig,et al.  Advances in speech transcription at IBM under the DARPA EARS program , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[8]  Hiroshi Sawada,et al.  Blind sparse source separation for unknown number of sources using Gaussian mixture model fitting with Dirichlet prior , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[9]  Reinhold Häb-Umbach,et al.  Blind speech separation employing directional statistics in an Expectation Maximization framework , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[10]  Lukás Burget,et al.  Recurrent neural network based language model , 2010, INTERSPEECH.

[11]  Takuya Yoshioka,et al.  Blind Separation and Dereverberation of Speech Mixtures by Joint Optimization , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[12]  Lukás Burget,et al.  Empirical Evaluation and Combination of Advanced Language Modeling Techniques , 2011, INTERSPEECH.

[13]  Hiroshi Sawada,et al.  Underdetermined Convolutive Blind Source Separation via Frequency Bin-Wise Clustering and Permutation Alignment , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[14]  Dong Yu,et al.  Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[15]  Tomohiro Nakatani,et al.  Generalization of Multi-Channel Linear Prediction Methods for Blind MIMO Impulse Response Shortening , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[16]  Tomohiro Nakatani,et al.  Making Machines Understand Us in Reverberant Rooms: Robustness Against Reverberation for Automatic Speech Recognition , 2012, IEEE Signal Process. Mag..

[17]  Lukás Burget,et al.  Transcribing Meetings With the AMIDA Systems , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[18]  Yuuki Tachioka,et al.  DISCRIMINATIVE METHODS FOR NOISE ROBUST SPEECH RECOGNITION: A CHIME CHALLENGE BENCHMARK , 2013 .

[19]  Steve Renals,et al.  Hybrid acoustic models for distant and multichannel large vocabulary speech recognition , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[20]  Lukás Burget,et al.  Sequence-discriminative training of deep neural networks , 2013, INTERSPEECH.

[21]  Hank Liao,et al.  Speaker adaptation of context dependent deep neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[22]  Hiroshi Sawada,et al.  A Multichannel MMSE-Based Framework for Speech Source Separation and Noise Reduction , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[23]  Steve Renals,et al.  Revisiting hybrid and GMM-HMM system combination techniques , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[24]  Dong Yu,et al.  Automatic Speech Recognition: A Deep Learning Approach , 2014 .

[25]  Yuuki Tachioka,et al.  The MERL/MELCO/TUM system for the REVERB Challenge using Deep Recurrent Neural Network Feature Enhancement , 2014, ICASSP 2014.

[26]  Atsushi Nakamura,et al.  Real-time one-pass decoding with recurrent neural network language model for speech recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27]  Qiang Chen,et al.  Network In Network , 2013, ICLR.

[28]  Masakiyo Fujimoto,et al.  LINEAR PREDICTION-BASED DEREVERBERATION WITH ADVANCED SPEECH ENHANCEMENT AND RECOGNITION TECHNOLOGIES FOR THE REVERB CHALLENGE , 2014 .

[29]  Gerald Penn,et al.  Convolutional Neural Networks for Speech Recognition , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[30]  Mark J. F. Gales,et al.  Impact of single-microphone dereverberation on DNN-based meeting transcription systems , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[31]  Takuya Yoshioka,et al.  Relaxed disjointness based clustering for joint blind source separation and dereverberation , 2014, 2014 14th International Workshop on Acoustic Signal Enhancement (IWAENC).

[32]  T. Yoshioka,et al.  Environmentally robust ASR front-end for deep neural network acoustic models , 2015, Comput. Speech Lang..

[33]  Tara N. Sainath,et al.  Deep Convolutional Neural Networks for Large-scale Speech Tasks , 2015, Neural Networks.

[34]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Jon Barker,et al.  The third ‘CHiME’ speech separation and recognition challenge: Dataset, task and baselines , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[36]  Takuya Yoshioka,et al.  Robust MVDR beamforming using time-frequency masks for online/offline ASR in noise , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).